“I don’t like your mum”


VAR, eh?

So, does video-assisted refereeing (VAR) improve the quality of decision-making in football matches?

Of course, that’s not the only question about VAR: assuming there is an improvement, one has to ask whether it’s worth either the expense or the impact it has on the flow of games when an action is reviewed. But these are subjective questions, whereas the issue about improvements in decision-making is more objective, at least in principle. With this in mind, IFAB, the body responsible for determining the laws of football, have sponsored statistical research into the extent to which VAR improves the accuracy of refereeing decisions.

But before looking at that, it’s worth summarising how the VAR process works. VAR is limited to an evaluation of decisions made in respect of four types of events:

  • Goals
  • Penalties
  • Straight red cards
  • Mistaken identity in the award of cards

And there are two modes of operation of VAR:

  • Check mode
  • Review mode

The check mode runs in the background throughout the whole game, without initiation by the referee. All incidents of the above type are viewed and considered  by the VAR, and those where a potential error are checked, with the assistance of replays if necessary. Such checks are used to identify situations where the referee is judged to have made a ‘clear and obvious error’ or there has been a ‘serious missed incident’.  Mistakes for other types of incidents – e.g. the possible award of a free kick – or mistakes that are not judged to be obvious errors should be discarded during the check process.

When a check by VAR does reveal a possible mistake of the above type, the referee is notified, who is then at liberty to carry out a review of the incident. The review can consist solely of a description of the event from the VAR to the referee, or it can comprise a video review of the incident by the referee using a screen at the side of the pitch. The referee is not obliged to undertake a review of an incident, even if flagged by the VAR following a check. On the other hand, the referee may choose to carry out a review of an incident, even if it has not been flagged by the VAR.

Hope that’s all clear.

Anyway, the IFAB report analysed more than 800 competitive games in which VAR was used, and includes the following statistics:

  • 56.9% of checks were for penalties and goals; almost all of the others were for red card incidents;
  • On average there were fewer than 5 checks per match;
  • The median check time of the VAR was 20 seconds
  • The accuracy of reviewable decisions before VAR was applied was 93%.
  • 68.8% of matches had no review
  • On average, there is one clear and obvious error every 3 matches
  • The decision accuracy after VAR is applied is 98.9%.
  • The median duration of a review is 60 seconds
  • The average playing time lost due to VAR is less than 1% of the total playing time.
  • In 24% of matches, VAR led to a change in a referee’s decision; in 8% of matches this change led to a decisive change in the match outcome.
  • A clear and obvious error was not corrected by VAR in around  5% of matches.

This all seems very impressive. A great use of Statistics to check the implementation of the process and to validate its ongoing use. And maybe that’s the right conclusion. Maybe. It’s just that, as a statistician, I’m still left with a lot of questions. Including:

  1. What was the process for checking events, both before and after VAR? Who decided if a decision, either with or without VAR, was correct or not?
  2. It would be fairest if the analysis of incidents in this experiment were done ‘blind’. That’s to say, when an event is reviewed, the analyst should be unaware of what the eventual decision of the referee was. This would avoid the possibility of the experimenter – perhaps unintentionally – being drawn towards incorrect agreement with the VAR process decision.
  3. It’s obviously the case when watching football, that even with the benefit of slow-motion replays, many decisions are marginal. They could genuinely go either way, without being regarded as wrong decisions. As such, the impressive-looking 93% and 98.9% correct decision rates are probably more fairly described as rates of not incorrect decisions.
  4. There’s the possibility that incidents are missed by the referee, missed by VAR and missed by whoever is doing this analysis. As such, there’s a category of errors that are completely ignored here.
  5. Similarly, maybe there’s an average of only 5 checks per match because many relevant incidents are being missed by VAR.
  6. The use of the median to give average check and review times could be disguising the fact that some of these controls take a very long time indeed. It would be a very safe bet that the mean times are much bigger than the medians, and would give a somewhat different picture of the extent to which the process interrupts games when applied.

So, I remain sceptical. The headline statistics are encouraging, but there are aspects about the design of this experiment and the presentation of results that I find questionable. And that’s before we assess value in terms of cost and impact on the flow of games.

On the other hand, there’s at least some evidence that VAR is having incidental effects that aren’t picked up by the above experiment. It was reported that in Italy Serie A,  the number of red cards given for dissent during the first season of VAR was one, compared with eleven in the previous season. The implication being that VAR is not just correcting mistakes, but also leading to players moderating their behaviour on the pitch. Not that this improvement is being universally adopted by all players in all leagues of course. But anyway, this fact that VAR might actually be improving the game in terms of the way it’s played, above and beyond any potential improvements to the refereeing process, is an interesting aspect, potentially in VAR’s favour, which falls completely outside the scope of the IFAB study discussed above.

But in terms of VAR’s impact on refereeing decisions, I can’t help feeling that the IFAB study was designed, executed and presented in a way that shines the best possible light on VAR’s performance.

Incidentally, if you’re puzzled by the title of this post, you need to open the link I gave above, and exercise your fluency in Spanish vernacular.

Picture this

You can’t help but be amazed at the recent release of the first ever genuine image of a black hole. The picture itself, and the knowledge of what it represents, are extraordinary enough, but the sheer feat of human endeavour that led to this image is equally breathtaking.

Now, as far as I can see from the list of collaborators that are credited with the image, actual designated statisticians didn’t really contribute. But, from what I’ve read about the process of the image’s creation, Statistics is central to the underlying methodology. I don’t understand the details, but the outline is something like this…

Although black holes are extremely big, they’re also a long way away. This one, for example, has a diameter that’s bigger than our entire solar system. But it’s also at the heart of the Messier 87 galaxy, some 55 million light years away from Earth. Which means that when looking towards it from Earth, it occupies a very small part of space. The analogy that’s been given is that capturing the black hole’s image in space would be equivalent to trying to photograph a piece of fruit on the surface of the moon. And the laws of optics imply this would require a telescope the size of our whole planet.

To get round this limitation, the Event Horizon Telescope (EHT) program uses simultaneous signals collected from a network of eight powerful telescopes stationed around the Earth. However, the result, naturally, is a sparse grid of signals rather than a complete image. The rotation of the earth means that with repeat measurements this grid gets filled-out a little. But still, there’s a lot of blank space that needs to be filled-in to complete the image. So, how is that done?

In principle, the idea is simple enough. This video was made some years ago by Katie Bouman, who’s now got worldwide fame for leading the EHT program to produce the black hole image:

The point of the video is that to recognise the song, you don’t need the whole keyboard to be functioning. You just need a few of the keys to be working – and they don’t even have to be 100% precise – to be able to identify the whole song. I have to admit that the efficacy of this video was offset for me by the fact that I got the song wrong, but in the YouTube description of the video, Katie explains this is a common mistake, and uses the point to illustrate that with insufficient data you might get the wrong answer. (I got the wrong answer with complete data though!)

In the case of the music video, it’s our brain that fills in the gaps to give us the whole tune. In the case of the black hole data, it’s sophisticated and clever picture imaging techniques, that rely on the known physics of light transmission and a library of the patterns found in images of many different types. From this combination of physics and library of image templates, it’s possible to extrapolate from the observed data to build proposal images, and for each one find a score of how plausible that image is. The final image is then the one that has the greatest plausibility score. Engineers call this image reconstruction; but the algorithm is fundamentally statistical.

At least, that’s how I understood things. But here’s Katie again giving a much  better explanation in a Ted talk:

Ok, so much for black holes. Now, think of:

  1. Telescopes as football matches;
  2. Image data as match results;
  3. The black hole as a picture that contains information about how good football teams really are;
  4. Astrophysics as the rules by which football matches are played;
  5. The templates that describe how an image changes from one pixel to the next as a rule for saying how team performances might change from one game to the next.

And you can maybe see that in a very general sense, the problem of reconstructing an image of a black hole has the same elements as that of estimating the abilities of football teams. Admittedly, our football models are rather less sophisticated, and we don’t need to wait for the end of the Antarctic winter to ship half a tonne of hard drives containing data back to the lab for processing. But the principles of Statistics are generally the same in all applications, from black hole imaging to sports modelling, and everything in between.

Olé, Olé, Olé


So, everyone agrees that Ole Solskjær has been a breath of fresh air at Man United and is largely responsible for their remarkable turn around this season. But here’s a great article by the guys at StatsBomb that adds perspective to that view. Sure, there’s been a change in results since Solskjær arrived, but more importantly xG – the expected goals – have also improved considerably, both in terms of attack and defence. This suggests that the results are not just due to luck; United are genuinely creating more chances are preventing those for the opposition at a greater rate than under Mourinho.

Nonetheless, United’s performance in terms of actual goals is out-performing that of xG: at the time of the StatsBomb report, total xG for United over all games under Solskjær was 17.72, whereas actual goals were 25; and total xG against United was 10.99, with actual goals at 8. In other words, they’ve scored more, and conceded fewer, goals than their performance merits. This suggests that, notwithstanding the improvement in performance, United have also benefited from an upsurge in luck, both in attack and defence.

But more generally, what is the value of a good manager? This recent article references a statistical analysis of data from the German Bundesliga, which aimed to quantify the potential effect a manager could have on a team. It’s not a completely straightforward issue, since the best managers tend to go to the best clubs, who are bound to have a built-in tendency for success that’s not attributable to the manager. Therefore, the research attempted to distinguish between team and manager effects. Their conclusions were:

  • The best 20% of managers were worth around 0.3 points per game more than the weakest 20% of managers. This amounts to 10.2 points over a 34-game season in the Bundesliga.
  • A manager’s estimated performance proved to be a valuable predictor in team performance when a manager changed clubs.
  • The best and worst managers have a strong impact on team performance. For teams with managers having closer to average ability, team performance is more heavily impacted by other factors, such as player quality and recruitment strategy.

In summary, on the basis of this research, there is value in aiming for the best of managers, and avoiding the worst, but not much evidence to suggest it’s worth shopping around in the middle. There are some caveats to this analysis though, and in particular about the way it’s described in the Guardian article:

  1. The analysis uses data from German leagues only up to 2013-14.
  2. This amounts to a total of just 6,426 matches, and includes relatively few managers.
  3. The Guardian article states ‘budget per season’ was accounted for. It wasn’t.
  4. The Guardian article refers to ‘statistical wizardry’. This consists of simple linear regression on points per half season with separate effects for managers and teams. This might be a sensible strategy, but it’s not wizardry.

So, it’s best to treat the precise conclusions of this report with some caution. Nonetheless, the broad picture it paints is entirely plausible.

And going back to Solskjær: there are good reasons to believe he is partly responsible for the overall improvement in performance at United, but a comparison between goals and xG suggests that the team have also been a bit on the lucky side since his arrival, and that their results have flattered to deceive a little.

Heads dropping


Here’s a fun probability problem that Benoit.Jottreau@smartodds.co.uk showed me. If you’re clever at probability, you might be able to solve it exactly; otherwise it’s easy to simulate. But as with previous problems of this type, I think it’s more interesting to find out what you would guess the answer to be, without thinking about it too deeply.

So, suppose you’ve got 10 coins. They’re fair coins, in the sense that if you toss any of them, they’re equally likely to come up heads or tails. You toss all 10 coins. You then remove the ones that come up heads. The remaining ones – the ones that come up tails – you toss again in a second round. Again, you remove any that come up heads, and toss again the ones that come up tails in a third round. And so on. In each round, you remove the coins that come up heads, and toss again the coins that come up tails. You stop once all of the coins have been removed.

The question: on average, how many rounds of this game do you need before all of the coins have been removed?

There are different mathematical ways of approaching this problem, but I’m not really interested in those. I’m interested in how good we are, collectively, at using our instincts to guess the solution to a problem of this type. So, I’d really appreciate it if you’d send me your best guess.

Actually, let’s make it a little more interesting. Can you send me an answer to a second question as well?

Second question: same game as above, but starting with 100 coins. This time, on average, how many rounds do you need before all of the coins have been removed?

Please send your answers to me directly or via this survey form.

I’ll discuss the answers you (hopefully) send me, and the problems themselves in more detail, in a subsequent post.

Please don’t fill out the survey if you solved the problem either mathematically or by simulation, though if you’d like to send me your solutions in either of those cases, I’d be very happy to look at them and discuss them with you.


Calling BS

You have to be wary of newspaper articles published on 1 April, but I think this one is genuine. The Guardian on Monday contained a report about scientific research into bullshit. Or more specifically, a scientific/statistical study into the demographics of bullshitting.

Now, to make any sense of this, it’s important first to understand what bullshit is.  Bullshit is different from lying. The standard treatise in this field is ‘On Bullshit‘ by Harry Frankfurt. I’m not kidding. He writes:

It is impossible for someone to lie unless he thinks he knows the truth. Producing bullshit requires no such conviction

In other words, bullshitting is providing a version of events that gives the impression you know what you are talking about, when in fact you don’t.

Unfortunately, standard dictionaries tend to define bullshitting as something like ‘talking nonsense’, though this is – irony alert – bullshit. This article explains why and includes the following example. Consider the phrase

Hidden meaning transforms unparalleled abstract beauty.

It argues that since the sentence is grammatically correct, but intellectually meaningless, it is an example of bullshit. On the other hand, the same set of words in a different order, for example

Unparalleled transforms meaning beauty hidden abstract.

are simply nonsense. Since they lack grammatical structure, the author isn’t bullshitting. He’s just talking garbage.

So, bullshit is different from lying in that the bullshitter will generally not know the truth; and it’s different from nonsense in that it has specific intent to deceive or misdirect.

But back to the Guardian article. The statistical study it refers to reveals a number of interesting outcomes:

  • Boys bullshit more than girls;
  • Children from higher socioeconomic backgrounds tend to bullshit more than those from poorer backgrounds;
  • North Americans bullshit the most (among the countries studied);
  • Bullshitters tend to perceive themselves as self-confident and high achievers.

If only I could think of an example of a self-confident, North American male from a wealthy background with a strong tendency to disseminate bullshit in order to illustrate these points.

But what’s all this got to do with Statistics? Well, it cuts both ways. First, the cool logic of Statistics can be used to identify and correct bullshit. Indeed, if you happen to study at the University of Washington, you can enrol for the course ‘Calling Bullshit: Data Reasoning in a Digital World‘, which is dedicated to the subject. The objectives for this course, as listed in its syllabus, are that after the course you should be able to:

  • Remain vigilant for bullshit contaminating your information diet.
  • Recognize said bullshit whenever and wherever you encounter it.
  • Figure out for yourself precisely why a particular bit of bullshit is bullshit.
  • Provide a statistician or fellow scientist with a technical explanation of why a claim is bullshit.
  • Provide your crystals-and-homeopathy aunt or casually racist uncle with an accessible and persuasive explanation of why a claim is bullshit.

I especially like the fact that after following this course you’ll be well-equipped to take on both the renegade hippy and racist wings of your family.

So that’s the good side of things. On the bad side, it’s extremely easy to use Statistics to disseminate bullshit. Partly because not everyone is sufficiently clued-up to really understand statistical concepts and to be critical when confronted with them; and partly because, even if you have a good statistical knowledge and are appropriately sceptical, you’re still likely to have to rely on the accuracy of the analysis, without access to the data on which they were based.

For example, this article, which is an interesting read on the subject of Statistics and bullshit, discusses a widely circulated fact, attributed to the Crime Statistics Bureau of San Francisco, that:

81% of white homicide victims were killed by blacks

Except, it turns out, that the Crime Statistics Bureau of San Francisco doesn’t exist and FBI figures actually suggest that 80% of white murder victims were killed by other white people. So, it’s a bullshit statement attributed to  a bullshit organisation. But with social media, the dissemination of these mis-truths becomes viral, and it becomes impossible to enable corrections with actual facts. Indeed, the above statement was included in an image posted to twitter by Donald Trump during his election campaign: full story here. And that tweet alone got almost 7000 retweets. So though, using reliable statistics, the claim is easily disproved, the message is already spread and the damage done.

So, welcome to Statistics: helping, and helping fight, bullshit.




Needles, noodles and 𝜋

A while back, on Pi Day, I sent a post celebrating the number 𝜋 and mentioned that though 𝜋 is best known for its properties in connection with the geometry of a circle, it actually crops up all over the place in mathematics, including Statistics.

Here’s one famous example…

Consider a table covered with parallel lines like in the following figure.

linesFor argument’s sake, let’s suppose the lines are 10 cm apart. Then take a bunch of needles – or matches, or something similar – that are 5 cm in length, drop them randomly onto the table, and count how many intersect one of the lines on the table.  Let’s suppose there are N needles and m of them intersect one of the lines. It turns out that N/m will be approximately 𝜋, and that the approximation is likely to improve if we repeat the experiment with a bigger value of N.

What this means in practice is that we have a statistical way of calculating 𝜋. Just do the experiment described above, and as we get through more and more needles, so the calculation of N/m is likely to lead to a better and better approximation of 𝜋.

There are various apps and so on that replicate this experiment via computer simulation, including this one, which is pretty nice. The needles which intersect any of the lines are shown in red; the others remain blue. The ratio N/m is shown in real-time, and if you’re patient enough it should get closer to the true value of 𝜋, the longer you wait. The approximation is also shown geometrically – the ratio N/m is very close to the ratio of a circle’s circumference to its diameter.

One important point though: the longer you wait, the greater will be the tendency for the approximation N/m to improve. However,  because of random variation in individual samples, it’s not guaranteed to always improve. For a while, the approximation might get a little worse, before inevitably (but perhaps slowly) starting to improve again.

In actual fact, there’s no need for the needles in this experiment to be half the distance between the lines. Suppose the ratio between the line separation and the needle length is r, then 𝜋 is approximated by

\hat{\pi} = \frac{2rN}{m}

In the simpler version above, r=1/2, which leads to the above result

\hat{\pi} = \frac{N}{m}

Now, although Buffon’s needle provides a completely foolproof statistical method of calculating 𝜋, it’s a very slow procedure. You’re likely to need very many needles to calculate 𝜋 to any reasonable level of accuracy. (You’re likely to have noticed this if you looked at the app mentioned above). And this is true of many statistical simulation procedures: the natural randomness in experimental data means that very large samples may be needed to get accurate results. Moreover, every time you repeat the experiment, you’re likely to get a different answer, at least to some level of accuracy.

Anyway… Buffon’s needle takes its name from Georges-Louis Leclerc, Comte de Buffon, a French mathematician in the 18th century who first posed the question of what the probability would be for a needle thrown at random to intersect a line. And Buffon’s needle is a pretty well-known problem in probability and Statistics.

Less well-known, and even more remarkable, is Buffon’s noodle problem. Suppose the needles in Buffon’s needle problem are allowed to be curved. So rather than needles, they are noodles(!) We drop N noodles – of possibly different shapes, but still 5 cm in length – onto the table, and count the total number of times the noodles cross a line on the table. Because of the curvature of the noodles, it’s now possible that a single noodle crosses a line more than once, so m is now the total number of line crossings, where the contribution from any one noodle might be 2 or more. Remarkably, it turns out that despite the curvature of the noodles and despite the fact that individual noodles might have multiple line crossings, the ratio N/m still provides an approximation to 𝜋 in exactly the same way it did for the needles.

This result for Buffon’s noodle follows directly from that of Buffon’s needle. You might like to try to think about why that is so. If not, you can find an explanation here.

Finally, a while back, I sent a post about Mendelian genetics. In it I discussed how Mendel used a statistical analysis of pea experiments to develop his theory of genetic inheritance. I pointed out, though, that while the theory is undoubtedly correct, Mendel’s statistical results were almost certainly too good to be true. In other words, he’d fixed his results to get the experimental results which supported his theory. Well, there’s a similar story connected to Buffon’s needle.

In 1901, an Italian mathematician, Mario Lazzarini, carried out Buffon’s needle experiment with a ratio of r=5/6. This seems like a strangely arbitrary choice. But as explained in Wikipedia, it’s a choice which enables the approximation of 355/113, which is well-known to be an extremely accurate fractional approximation for 𝜋. What’s required to get this result is that in a multiple of 213 needle throws, the same multiple of 113 needles intersect a line. In other words, 113 intersections when throwing 213 needles. Or 226 when throwing 426. And so on.

So, one explanation for Lazzarini’s remarkably accurate result is that he simply kept repeating the experiment in multiples of 213 throws until he got the answer he wanted, and then stopped. Indeed, he reported a value of N=3408, which happens to be 16 times 213. And in those 3408 throws, he reportedly got 1808 line intersections, which happens to be 16 times 113.

An alternative explanation is that Lazzarini didn’t do the experiment at all, but pretended he did with the numbers chosen as above so as to force the result to be the value that he actually wanted it to be. I know that doesn’t seem like a very Italian kind of thing to do, but there is some circumstantial evidence that supports this possibility. First, as also explained in Wikipedia:

A statistical analysis of intermediate results he reported for fewer tosses leads to a very low probability of achieving such close agreement to the expected value all through the experiment.

Second, Lazzarini reportedly described a physical machine that he used to carry out the experimental needle throwing. However, a basic study of the design of this machine shows it to be impossible from an engineering point of view.

So, like Mendel, it’s rather likely that Lazzarini invented some data from a statistical experiment just to get the answer that he was hoping to achieve. And the moral of the story? If you’re going to make evidence up to ‘prove’ your answer, build a little bit of statistical error into the answer itself, otherwise you might find statisticians in 100 years’ time proving (beyond reasonable doubt) you cheated.

Britain’s Favourite Crisps


As I’ve mentioned before, my aim in this blog is to raise awareness and understanding of statistical concepts and procedures, particularly with regard to potential applications in sports modelling. Often this will involved discussing particular techniques and methodologies. But sometimes it might involve simply referencing the way statistics has been used to address some particular important topic of the day.

With this latter point in mind, Channel 5 recently showed a program titled ‘Britain’s Favourite Crisps’ in which they revealed the results of a survey investigating, well, Britain’s favourite crisps. Now, if your cultural roots are not based in the UK, the complexities of crisp preference might seem as strange as the current wrangling over Brexit. But those of you who grew up in the UK are likely to be aware of the sensitivities of this issue. Crisp preferences, that is. Let’s not get started on Brexit.

A summary of the results of the survey are contained in the following diagram:

And a complete ranking of the top 20 is included here.

As you might expect for such a contentious issue, the programme generated a lot of controversy. For example:

And so on.

Personally, I’m mildly upset – I won’t say outraged exactly – at Monster Munch appearing only in the Mid-Tier. But let me try to put my own biases aside and turn to some statistical issues. These results are based on some kind of statistical survey, but this raises a number of questions. For example:

  1. How many people were included in the survey?
  2. How were they interviewed? Telephone? Internet? Person-to-person?
  3. How were they selected? Completely randomly? Or balanced to reflect certain demographics? Something else?
  4. What were they asked? Just their favourite? Or a ranking of their top 20 say?
  5. Were participants given a list of crisps to choose from, or were they given complete freedom of choice?
  6. Is it fair to consider Walkers or Pringles as single categories, when they cover many different flavours, while other crisps, such as Quavers, have just a single variety?
  7. How were results calculated? Just straight averages based on sample results, or weighted to correct demographic imbalances in the survey sample?
  8. How was the issue of non-respondents handled?
  9. How certain can we be that the presented results are representative of the wider population?
  10. Is a triangle appropriate for representing the results? It suggests the items in each row are equivalent. Was that intended? If so, is it justified by the results?

It may be that some of these questions are answered in the programme itself. Unfortunately, living outside the UK, I can’t access the programme, but those of you based in the UK can, at least for some time, here. So, if you are able to watch it and get answers to any of the questions, please post them in the comments section. But my guess is that most of the questions will remain unanswered.

So, what’s the point? Well, statistical analyses of any type require careful design and analysis. Decisions have to be made about the design and execution of an experiment, and these are likely to influence the eventual results. Consequently, the analysis itself should also take into account the way the experiment was designed, and attempt to correct for potential imbalances. Moreover, a proper understanding of the results of a statistical analysis require detailed knowledge of all aspects of the analysis, from design to analysis.

And the message is, never take results of a statistical analysis on trust. Ask questions. Query the design. Ask where the data came from. Check the methodology. Challenge the results. Ask about accuracy. Question whether the results have been presented fairly.

Moreover, remember that Statistics is as much an art as a science. Both the choice of design of an experiment and the randomness in data mean that a different person carrying out the same analysis is likely to get different results.

And all of this is as true for sports modelling as it is for the ranking of Britain’s favourite crisps.

The Datasaurus Dataset

Look at the data in this table. There are 2 rows of data labelled g1 and g2.  I won’t, for the moment, tell you where the data come from, except that the data are in pairs. So, each column of the table represents a pair of observations: (2, 1) is the first pair, (3, 5) is the second pair and so on. Just looking at the data, what would you conclude?

Scroll down once you’ve thought about this question.


Maybe you’re better at this stuff than me, but I wouldn’t find this an easy question to answer. Even though there are just 10 observations, and each observation contains just a pair of values, I find it difficult to simply look at the numbers and see any kind of pattern at all, either in the individual rows of numbers, or in any possible relationship between the two. And if it’s difficult in this situation, it’s bound to be much more difficult when there might be many thousands or millions of observations, and each observation might not be just a pair, but several – perhaps many – numbers.

So, not easy. But it’s a standard statistical requirement: taking a set of observations – in this case pairs – and trying to understand what they might convey about the process they come from. It’s really the beating heart of Statistics: trying to understand structure from data. Yet even with just 10 pairs of observations, the task isn’t straightforward.

To deal with this problem an important aspect of statistic analysis is the  summarisation of data – reducing the information they contain to just a few salient features. Specifically, in this case, reducing the information that’s contained in the 10 pairs of observations to a smaller number of numbers – so-called statistics – that summarise the most relevant aspects of the information that the data contain. The most commonly-used statistics, as you probably know, are:

  1. The means: the average values of each of the g1 and g2 sets of values.
  2. The standard deviations: measures of spread around the means of each of the g1 and g2 sets of values.
  3. The correlation: a measure, on a scale of -1 to 1, of the tendency for the g1 and g2 values to be related to each other.

The mean is well-known. The standard deviation is a measure of how spread out a set of values are: the more dispersed the numbers, the greater the standard deviation. Correlation is maybe less well understood, but provides a measure of the extent to which 2 sets of variables are linked to one another (albeit in a linear sense).

So, rather than trying to identify patterns in a set of 10 pairs of numbers, we reduce the data to their main features:

  • g1 mean = 2.4; g2 mean = 1.8
  • g1 standard deviation = 0.97; g2 standard deviation = 1.48
  • (g1,g2) correlation = 0.22

And from this we can start to build a picture of what the data tell us:

  1. The average value of g1 is rather greater – actually 0.6 greater – than the mean of g2, so there is a tendency for the g1 component of a pair to be bigger than the g2 component.
  2. The g2 values are more spread out than the g1 values.
  3. The positive value of correlation, albeit a value substantially lower than the maximum of 1, suggests that there is a tendency for the g1 and g2 components to be associated: bigger values of g1 tend to imply bigger values of g2.

So now let me tell you what the data are: they are the home and away scores, g1 and g2 respectively, in the latest round of games – matchday 28- in Serie A. So, actually, the summary values make quite good sense: the mean of g1 is greater than the mean of g2, which is consistent with a home advantage effect. And it’s generally accepted that home and away scores tend to be positively correlated. It’s maybe a little surprising that the standard deviation of away goals is greater than that of home goals, but with just 10 games this is very likely just to be a chance occurrence.

Which gives rise to a different issue: we’re unlikely to be interested in the patterns contained in the data from these particular 10 games. It’s much more likely we’re interested in what they might tell us about the pattern of results in a wider set of games –  perhaps Serie A games from any arbitrary matchday.

But that’s a story for another post sometime. The point of this post is that we’re simply not programmed to look at large (or even quite small) datasets and be able to see any patterns or messages they might contain.  Rather, we have to summarise data with just a few meaningful statistics in order to understand and compare them.

But actually, all of the above is just a precursor to what I actually wanted to say in this post. Luigi.Colombo@smartodds.co.uk recently forwarded the following twitter post to the quant team on RocketChat. Press the start arrow to set off the animation.

As explained in the message, every single one of the images in this animation – including the passages from one of the main images to another –  has exactly the same summary statistics. Thats to say, the mean and standard deviation of both the x- and y-values stay the same, as does the correlation between the two sets of values.

So what’s the moral here? Well, as we saw above, reduction of data to simple summary statistics is immensely helpful in getting a basic picture of the structure of data. But: it is a reduction nonetheless, and something is lost. All of the datasets in the the twitter animation have identical summary statistics, yet the data themselves are dramatically different from one image to another.

So, yes, follow my advice above and use summary statistics to understand data better. But be aware that a summary of data is just that, a summary, and infinitely many other datasets will have exactly the same summary statistics. If it’s important to you that your data look more like concentric ellipses than a dinosaur, you’d better not rely on means and standard deviations to tell you so.

Altered images

In a recent post I described the following problem which I encountered while sitting in a dentist waiting room:

Images are randomly selected from a library of images and shown on a screen. After watching the screen for a while, I notice one or more of the images is a repeat showing of an earlier image. How can I use information on the number of images observed and the number of repeats to estimate how many images there are in the entire library?

I had two great replies suggesting solutions to this problem. The first was from Nity.Raj@smartodds.co.uk

Surely the efficient thing to do is to hack the database of images so you just find out how many there are in fact, rather than estimating?

It’s the perfect answer, but I just need to run it past someone with a legal background who’s connected to Smartodds to check it’s compliant with relevant internet communication laws. Can anyone suggest somebody suitable?

The other idea was from Ian.Rutherford@smartbapps.co.uk who suggested this:

I would take the total of all the images seen and divide it by the number of times I spotted the 23 to Leigh Park to give an estimation of the number of different images

You’ll have to read the original post to understand the ’23 to Leigh Park’ bit of this answer, but you can take it as a reference to any one of the images that you’ve seen. So, let’s suppose I’ve seen 100 images, and I’ve seen one particular image that I’m interested in 4 times. Then Ian’s suggestion is to estimate the total number of images as


Ian didn’t explain his answer, so I hope I’m not doing him a disservice, but I think the reasoning for this solution is as follows. Suppose the population size is N and I observe v images. Then since the images occur at random, the probability I will see any particular image when a random image is shown is 1/N. So the average, or expected, number of times I will see a particular image in a sequence of v images is v/N. If I end up seeing the image t times, this means I should estimate v/N with t. But rearranging this, it means I estimate N with v/t.

It’s a really smart answer, but I think there are two slight drawbacks.

  1. Suppose, in the sequence of 100 images, I’d already seen 26 (or more) different images. In that case I’d know the estimate of 25 was bound to be an under-estimate.
  2. This estimate uses information based on the number of repeats of just one image. Clearly, the number of repeats of each of the different images I observe is equally relevant, and it must be wasteful not to use the information they contain as well.

That said, the simplicity and logic of the answer are both extremely appealing.

But before receiving these answers, and actually while waiting at the dentist, I had my own idea. I’m not sure it’s better than Nity’s or Ian’s, and it has its own drawbacks. But it tells a nice story of how methods from one area of Statistics can be relevant for something apparently unrelated.

So, imagine you’re an ecologist and there’s concern that pollution levels have led to a reduction in the number of fish in a lake. To assess this possibility you need to get an estimate of how many fish there are in the lake.  The lake is large and deep, so surface observations are not useful. And you don’t have equipment to make sub-surface measurements.

What are you going to do?

Have a think about this before scrolling down.


One standard statistical approach to this problem is a technique called mark and recapture. There are many variations on this method, some quite sophisticated, but we’ll discuss just the simplest, which works as follows.

A number of fish are caught (unharmed), marked and released back into the lake. Let this number of fish be n, say.

Some time later, a second sample of fish – let’s say of size K – is taken from the lake. We observe that k fish of this second sample have the mark that we applied in the first sample. So k/K is the proportion of fish in the second sample that have been marked. But since this is just a random sample from the lake, we’d expect this proportion to be similar to the proportion of marked fish in the entire lake, which will be n/N.

Expressing this mathematically, we have an approximation

k/K \approx n/N

But we can rearrange this to get:

N \approx nK/k

In other words, we could use

\hat{N}= nK/k

as an estimate for the number of fish, since we’d expect this to be a reasonable approximation to the actual number N.

So, let’s suppose I originally caught, marked and released 100 fish. I subsequently catch a further 50 fish, of which 5 are marked. Then, n=100, K=50, k=5 and so

\hat{N} =  nK/k = 100 \times 50 /5 =1000

and I’d estimate that the lake contains 1000 fish.

Now, maybe you can see where this is going. Suppose instead of a lake of fish, we have a library of images. This method would allow me to estimate the size of the population of images, just as it does a population of fish. But there’s a slight catch (if you’ll pardon the pun). When I take a sample of fish from a lake, each of the fish in the sample is unique. But when I look at a selection of images at the dentist, some of them may be repeats. So I can’t quite treat my sample of images in exactly the same way as I would a sample of fish. To get round this problem I have to ignore the repeated images within each sample. So, my strategy is this:

  1. Observe a number of the images, ignoring any repeats. Call the number of unique images n.
  2. Observe a second set of images. Let the number of unique images in this set be K, but keeping count of repeats with the first set. Let’s say the number of repeats with the first sample is k.

The estimate of the population size – for the same reasons as estimating fish population sizes – is then

\hat{N} =  nK/k.

So, suppose I chose to look at images for 10 minutes. In that period there were 85 images, but 5 of these were repeats. So, n=80. I then watch for another 5 minutes and observe 30 unique images, 4 of which were also observed in the first sample. So, n=80, K=30, m=4 and my estimate of the number of images in the database is

\hat{N} =  nK/k = 80 \times 30 /4 =600

Is this answer any better than Ian’s? I believe it uses more information available in the data, since it doesn’t focus on just one image. It’s also less likely to give an answer that is inconsistent with the data that I’ve already seen. But it does have drawbacks and limitations:

  1. Ignoring the information on repeats within each sample must also be wasteful of relevant information.
  2. The distinction between the first sample and second sample is arbitrary, and it might be that different choices lead to different answers.
  3. Keeping track of repeats within and across the two samples might be difficult in practice.

In a subsequent post I’ll do a more detailed study of the performance of the two methods. In the meantime, let me summarise what I think are the main points from this discussion:

  1. Statistical problems can occur in the most surprising places
  2. There’s usually no right or wrong way of tackling a statistical problem. One approach might be best from one point of view, while another is better from a different point of view.
  3. Statistics is a very connected subject: a technique that has been developed for one type of problem may be transferable to a completely different type of problem.
  4. Simple answers are not always be the best – though sometimes they are – but simplicity is a virtue in itself.

Having said all that, there are various conventional ways of judging the performance of a statistical procedure, and I’ll use some of these to compare my solution with Iain’s in the follow-up post. Meantime, I’d still be happy to receive alternative solutions to the problem, whose performance I can also compare against mine and Ian’s.