Revel in the amazement

In an earlier post I included the following table:

As I explained, one of the columns contains the genuine land areas of each country, while the other is fake. And I asked you which is which.

The answer is that the first column is genuine and the second is fake. But without a good knowledge of geography, how could you possibly come to that conclusion?

Well, here’s a remarkable thing. Suppose we take just the leading digit of each  of the values. Column 1 would give 6, 2, 2, 1,… for the first few countries, while column 2 would give 7, 9, 3, 3,… It turns out that for many naturally occurring phenomena, you’d expect the leading digit to be 1 on around 30% of occasions. So if the actual proportion is a long way from that value, then it’s likely that the data have been manufactured or manipulated.

Looking at column 1 in the table, 5 out of the 20 countries have a population with leading digit 1; that’s 25%. In column 2, none do; that’s 0%. Even 25% is a little on the low side, but close enough to be consistent with 30% once you allow for discrepancies due to random variations in small samples. But 0% is pretty implausible. Consequently, column 1 is consistent with the 30% rule, while column 2 is not, and we’d conclude – correctly – that column 2 is faking it.

But where does this 30% rule come from? You might have reasoned that each of the digits 1 to 9 were equally likely – assuming we drop leading zeros – and so the percentage would be around 11% for a leading digit of 1, just as it would be for any of the other digits. Yet that reasoning turns out to be misplaced, and the true value is around 30%.

This phenomenon is a special case of something called Benford’s law, named after the physicist Frank Benford who first formalised it. (Though it had also been noted much earlier by the astronomer Simon Newcomb). Benford’s law states that for many naturally occurring datasets, the probability that the leading digit of a data item is 1 is equal to 30.1%. Actually, Benford’s law goes further than that, and gives the percentage of times you’d get a 2 or a 3 or any of the digits 1-9 as the leading digit. These percentages are shown in the following table.

Leading Digit 1 2 3 4 5 6 7 8 9
Frequency 30.1% 17.6% 12.5% 9.7% 7.9% 6.7% 5.8% 5.1% 4.6%

For those of you who care about such things, these percentages are log(2/1), log(3/2), log(4/3) and so on up to log(10/9), where log here is logarithm with respect to base 10.

But does Benford’s law hold up in practice? Well, not always, as I’ll discuss below. But often it does. For example, I took a dataset giving the altitudes of a large set of football stadiums around the world. I discarded a few whose altitude is below sea level, but was still left with over 13,000 records. I then extracted the leading digit of each of the altitudes (in metres)  and plotted a histogram of these values. This is just a plot of the percentages of occasions each value occurred. These are the blue bars in the following diagram. I then superimposed the predicted proportions from Benford’s law. These are the black dots.

 

The agreement between the observed percentages and those predicted by Benford’s law is remarkable. In particular, the observed percentage of leading digits equal to 1 is almost exactly what Benford’s law would imply. I promise I haven’t cheated with the numbers.

As further examples, there are many series of mathematically generated numbers for which Benford’s law holds exactly.

These include:

  • The Fibonacci series: 1, 1, 2, 3, 5, 8, 13, …. where each number is obtained by summing the 2 previous numbers in the series.
  • The integer powers of two: 1, 2, 4, 8, 16, 32, …..
  • The iterative series obtained by starting with any number and successively multiplying by 3. For example, starting with 7, we get: 7, 21, 63, 189,….

In each of these cases of infinite series of numbers, exactly 30.1% will have leading digit equal to 1; exactly 17.6% will have leading digit equal to 2, and so on.

And there are many other published examples of data fitting Benford’s law (here, here, here… and so on.)

Ok, at this point you should pause to revel in the amazement of this stuff. Sometimes mathematics, Statistics and probability come together in a way to explain naturally occurring phenomena that is so surprising and shockingly elegant it takes your breath away.

So, when does Benford’s law work. And why?

It turns out there are various ways of explaining Benford’s law, but none of them – at least as far as I can tell – is entirely satisfactory. All of them require a leap of faith somewhere to match the theory to real-life. This view is similarly expressed in an academic article, which concludes:

… there is currently no unified approach that simultaneously explains (Benford’s law’s) appearance in dynamical systems, number theory, statistics, and real-world data.

Despite this, the various arguments used to explain Benford’s law do give some insight into why it might arise naturally in different contexts:

  1. If there is a law of this type, Benford’s law is the only one that works for all choices of scale. The decimal representation of numbers is entirely arbitrary, presumably deriving from the fact that humans, generally, have 10 fingers. But if we’d been born with 8 fingers, or chosen to represent numbers anyway in binary, or base 17, or something else, you’d expect a universal law to be equally valid, and not dependent on the arbitrary choice of counting system. If this is so, then it turns out that Benford’s law, adapted in the obvious way to the choice of scale, is the only one that could possibly hold. An informal argument as to why this should be so can be found here.
  2. If the logarithm of the variable under study has a distribution that is smooth and roughly symmetric – like the bell-shaped normal curve, for example – and is also reasonably well spread out, it’s easy to show that Benford’s law should hold approximately. Technically, for those of you who are interested, if X is the thing we’re measuring, and if log X has something like a normal distribution with a variance that’s not too small, then Benford’s law is a good approximation for the behaviour of X. A fairly readable development of the argument is given here. (Incidentally, I stole the land area of countries example directly from this reference.)

But in the first case, there’s no explanation as to why there should be a universal law, and indeed many phenomena – both theoretical and in nature – don’t follow Benford’s law. And in the second case, except for special situations where the normal distribution has some kind of theoretical justification as an approximation, there’s no particular reason why the logarithm of the observations should behave in the required way. And yet, in very many cases – like the land area of countries or the altitude of football stadiums – the law can be shown empirically to be a very good approximation to the truth.

One thing which does emerge from these theoretical explanations is a better understanding of when Benford’s law is likely to apply and when it’s not. In particular, the argument only works when the logarithm of the variable under study is reasonably well spread out. What that means in practice is that the variable itself needs to cover several orders of magnitude: tens, hundreds, thousands etc. This works fine for something like the stadium altitudes, which vary from close to sea-level up to around 4,000 metres, but wouldn’t work for total goals in football matches, which are almost always in the range 0 to 10, for example.

So, there are different ways of theoretically justifying Benford’s law, and empirically it seems to be very accurate for different datasets which cover orders of magnitude. But does it have any practical uses? Well, yes: applications of Benford’s law have been made in many different fields, including…

Finally, there’s also a version of Benford’s law for the second digit, third digit and so on. There’s an explanation of this extension in the Wikipedia link that I gave above. It’s probably not easy to guess exactly what the law might be in these cases, but you might try and guess how the broad pattern of the law changes as you move from the first to the second and to further digits.


Thanks to those of you wrote to me after I made the original post. I don’t think it was easy to guess what the solution was, and indeed if I was guessing myself, I think I’d have been looking for a uniformity in the distribution of the digits, which turns out to be completely incorrect, at least for the leading digit. Even though I’ve now researched the answer myself, and made some sense of it, I still find it rather shocking that the law works so well for an arbitrary dataset like the stadium altitudes. Like I say: revel in the amazement.

Faking it

 

Take a look at the following table:

fake_data

 

It shows the total land area, in square kilometres, for various countries. Actually, it’s the first part of a longer alphabetical list of all countries and includes two columns of figures, each purporting to be the corresponding area of each country. But one of these columns contains the real areas and the other one is fake. Which is which?

Clearly, if your knowledge of geography is good enough that you know the land area of Belgium – or any of the other countries in the table – or whether Bahrain is bigger than Barbados, then you will know the answer. You could also cheat and check with Google. But you can answer the question, and be almost certain of being correct, without cheating and without knowing anything about geography. Indeed, I could have removed the first column giving the country names, and even not told you that the data correspond to land areas, and you should still have been able to tell me which column is real and which is fake.

So, which column is faking it? And how do you know?

I’ll write a follow-up post giving the answer and explanation sometime soon. Meantime, if you’d like to write to me giving your own version, I’d be happy to hear from you.

 

Midrange is dead

Kirk Goldsberry is the author of a new book on data analytics for NBA. I haven’t read the book, but some of the graphical illustrations he’s used for its publicity are great examples of the way data visualization techniques can give insights about the evolution of a sport in terms of the way it is played.

 

Press the start button in the graphic of the above tweet.. I’m not sure exactly how the graphic and the data are mapped, but essentially the coloured hexagons show regions of the basketball court which are the most frequent  locations for taking shots. The animation shows how this pattern has changed over the seasons.

As you probably know, most goals in basketball – excluding penalty shots – are awarded 2 points. But a shot that’s scored from outside a distance of 7.24m from the basket – the almost semi-circular outer-zone shown in the figure – scores 3 points. So, there are two ways to improve the number of points you are likely to score when shooting: first, you can get closer to the basket, so that the shot is easier; or second, you can shoot from outside the three-point line, so increasing the number of points obtained when you do score. That means there’s a zone in-between, where the shot is still relatively difficult because of the distance from the basket, but for which you only get 2 points when you do score. And what the animation above clearly shows is an increasing tendency over the seasons for players to avoid shooting from this zone. This is perhaps partly because of a greater understanding of the trade-off between difficulty and distance, and perhaps also because improved training techniques have led to a greater competency in 3-point shots.

Evidence to support this reasoning is the following data heatmap diagram which shows the average number of points scored from shots taken at different locations on the court. The closer to red, the higher the average score per shot.

Again the picture makes things very clear: average points scored are highest when shooting from very close to the basket, or from outside of the 3-point line. Elsewhere the average is low. It’s circumstantial evidence, but the fact that this map of points scored has patterns that are so similar to the current map of where players are shooting from, there’s a strong suggestion that players have evolved their play style in order to shoot at the basket from positions which they know are more likely to generate the most points.

In summary, creative use of both static and animated graphical data representations provide great insights about the way basketball play has evolved, and why that evolution is likely to have occurred, given the 3-point shooting rule.


Thanks to Benoit.Jottreau@smartodds.co.uk for posting something along these lines on RocketChat.

Animal experiments

Ever thought your cat might be trolling you? Turns out you’re right. As explained in this New Scientist article, recent Japanese research concludes that cats are entirely capable of recognising their names; they just choose not to when it suits them.

The full details of the experiment are included in a research report published in Nature. It’s an interesting, though not entirely easy, read. But I’d like to use it to point out an aspect of statistical methodology that is often ignored: statistical analyses don’t usually start with the analysis of data; they start with the design of the experiment by which the data are to be collected. And it’s essential that an experiment is designed correctly in order to be able to use Statistics to answer the question you’re interested in.

So, in this particular study, the researchers carried out four separate experiments:

  • In experiment 1, the ability of cats to distinguish their own names from that of other similar nouns was tested;
  • In experiment 2, cats living with numerous other cats were tested to see if they could distinguish their own name from that of other cats in the same household;
  • Experiment 3 was like experiment 1, but using cats from a ‘cat cafe‘ (don’t ask) rather than a normal household;
  • Experiment 4 was also like experiment 1, but using a voice other than the cat’s owner to trigger the responses.

Through this sequence of experiments, the researchers were able to judge whether or not the cats genuinely recognise and respond to their own names in a variety of environments, and to exclude the possibility that the responses were due to factors other than actual name recognition. As such, this is a great example of how the design of an experiment has been carefully tailored to ensure that a statistical analysis of the data it generates is able to answer the question of interest.

I won’t go into details, but there are many other aspects of the experimental design that also required careful specification:

  1. The number of cats to be included in the study;
  2. The choice of words to use as alternative stimuli to the cats’ names, and the order in which they are used;
  3. The definitions of actions that are considered positive responses to stimuli;
  4. The protocol for determining whether a cat has responded positively to a stimuli or not;

amongst others. Full details are available in the Nature article, as indeed are the data, should you wish to analyse them yourself.

In the context of sports modelling, these kinds of issues are less explicit, since analyses are usually retrospective, using data that have already been historically collected and stored. Nonetheless, the selection of which data to include in an analysis can affect the analysis, and it’s important to ensure that results are not sensitive to specific, subjective choices. However, for analyses of data that include a decision process – such as betting strategies – it may well be relevant to formulate an experimental design for a prospective study, comparing results based on one type of strategy, compared with that of another. We’ll discuss strategies for this type of experiment in a future post.

 

Picture this

You can’t help but be amazed at the recent release of the first ever genuine image of a black hole. The picture itself, and the knowledge of what it represents, are extraordinary enough, but the sheer feat of human endeavour that led to this image is equally breathtaking.

Now, as far as I can see from the list of collaborators that are credited with the image, actual designated statisticians didn’t really contribute. But, from what I’ve read about the process of the image’s creation, Statistics is central to the underlying methodology. I don’t understand the details, but the outline is something like this…

Although black holes are extremely big, they’re also a long way away. This one, for example, has a diameter that’s bigger than our entire solar system. But it’s also at the heart of the Messier 87 galaxy, some 55 million light years away from Earth. Which means that when looking towards it from Earth, it occupies a very small part of space. The analogy that’s been given is that capturing the black hole’s image in space would be equivalent to trying to photograph a piece of fruit on the surface of the moon. And the laws of optics imply this would require a telescope the size of our whole planet.

To get round this limitation, the Event Horizon Telescope (EHT) program uses simultaneous signals collected from a network of eight powerful telescopes stationed around the Earth. However, the result, naturally, is a sparse grid of signals rather than a complete image. The rotation of the earth means that with repeat measurements this grid gets filled-out a little. But still, there’s a lot of blank space that needs to be filled-in to complete the image. So, how is that done?

In principle, the idea is simple enough. This video was made some years ago by Katie Bouman, who’s now got worldwide fame for leading the EHT program to produce the black hole image:

The point of the video is that to recognise the song, you don’t need the whole keyboard to be functioning. You just need a few of the keys to be working – and they don’t even have to be 100% precise – to be able to identify the whole song. I have to admit that the efficacy of this video was offset for me by the fact that I got the song wrong, but in the YouTube description of the video, Katie explains this is a common mistake, and uses the point to illustrate that with insufficient data you might get the wrong answer. (I got the wrong answer with complete data though!)

In the case of the music video, it’s our brain that fills in the gaps to give us the whole tune. In the case of the black hole data, it’s sophisticated and clever picture imaging techniques, that rely on the known physics of light transmission and a library of the patterns found in images of many different types. From this combination of physics and library of image templates, it’s possible to extrapolate from the observed data to build proposal images, and for each one find a score of how plausible that image is. The final image is then the one that has the greatest plausibility score. Engineers call this image reconstruction; but the algorithm is fundamentally statistical.

At least, that’s how I understood things. But here’s Katie again giving a much  better explanation in a Ted talk:

Ok, so much for black holes. Now, think of:

  1. Telescopes as football matches;
  2. Image data as match results;
  3. The black hole as a picture that contains information about how good football teams really are;
  4. Astrophysics as the rules by which football matches are played;
  5. The templates that describe how an image changes from one pixel to the next as a rule for saying how team performances might change from one game to the next.

And you can maybe see that in a very general sense, the problem of reconstructing an image of a black hole has the same elements as that of estimating the abilities of football teams. Admittedly, our football models are rather less sophisticated, and we don’t need to wait for the end of the Antarctic winter to ship half a tonne of hard drives containing data back to the lab for processing. But the principles of Statistics are generally the same in all applications, from black hole imaging to sports modelling, and everything in between.

Calling BS

You have to be wary of newspaper articles published on 1 April, but I think this one is genuine. The Guardian on Monday contained a report about scientific research into bullshit. Or more specifically, a scientific/statistical study into the demographics of bullshitting.

Now, to make any sense of this, it’s important first to understand what bullshit is.  Bullshit is different from lying. The standard treatise in this field is ‘On Bullshit‘ by Harry Frankfurt. I’m not kidding. He writes:

It is impossible for someone to lie unless he thinks he knows the truth. Producing bullshit requires no such conviction

In other words, bullshitting is providing a version of events that gives the impression you know what you are talking about, when in fact you don’t.

Unfortunately, standard dictionaries tend to define bullshitting as something like ‘talking nonsense’, though this is – irony alert – bullshit. This article explains why and includes the following example. Consider the phrase

Hidden meaning transforms unparalleled abstract beauty.

It argues that since the sentence is grammatically correct, but intellectually meaningless, it is an example of bullshit. On the other hand, the same set of words in a different order, for example

Unparalleled transforms meaning beauty hidden abstract.

are simply nonsense. Since they lack grammatical structure, the author isn’t bullshitting. He’s just talking garbage.

So, bullshit is different from lying in that the bullshitter will generally not know the truth; and it’s different from nonsense in that it has specific intent to deceive or misdirect.

But back to the Guardian article. The statistical study it refers to reveals a number of interesting outcomes:

  • Boys bullshit more than girls;
  • Children from higher socioeconomic backgrounds tend to bullshit more than those from poorer backgrounds;
  • North Americans bullshit the most (among the countries studied);
  • Bullshitters tend to perceive themselves as self-confident and high achievers.

If only I could think of an example of a self-confident, North American male from a wealthy background with a strong tendency to disseminate bullshit in order to illustrate these points.

But what’s all this got to do with Statistics? Well, it cuts both ways. First, the cool logic of Statistics can be used to identify and correct bullshit. Indeed, if you happen to study at the University of Washington, you can enrol for the course ‘Calling Bullshit: Data Reasoning in a Digital World‘, which is dedicated to the subject. The objectives for this course, as listed in its syllabus, are that after the course you should be able to:

  • Remain vigilant for bullshit contaminating your information diet.
  • Recognize said bullshit whenever and wherever you encounter it.
  • Figure out for yourself precisely why a particular bit of bullshit is bullshit.
  • Provide a statistician or fellow scientist with a technical explanation of why a claim is bullshit.
  • Provide your crystals-and-homeopathy aunt or casually racist uncle with an accessible and persuasive explanation of why a claim is bullshit.

I especially like the fact that after following this course you’ll be well-equipped to take on both the renegade hippy and racist wings of your family.

So that’s the good side of things. On the bad side, it’s extremely easy to use Statistics to disseminate bullshit. Partly because not everyone is sufficiently clued-up to really understand statistical concepts and to be critical when confronted with them; and partly because, even if you have a good statistical knowledge and are appropriately sceptical, you’re still likely to have to rely on the accuracy of the analysis, without access to the data on which they were based.

For example, this article, which is an interesting read on the subject of Statistics and bullshit, discusses a widely circulated fact, attributed to the Crime Statistics Bureau of San Francisco, that:

81% of white homicide victims were killed by blacks

Except, it turns out, that the Crime Statistics Bureau of San Francisco doesn’t exist and FBI figures actually suggest that 80% of white murder victims were killed by other white people. So, it’s a bullshit statement attributed to  a bullshit organisation. But with social media, the dissemination of these mis-truths becomes viral, and it becomes impossible to enable corrections with actual facts. Indeed, the above statement was included in an image posted to twitter by Donald Trump during his election campaign: full story here. And that tweet alone got almost 7000 retweets. So though, using reliable statistics, the claim is easily disproved, the message is already spread and the damage done.

So, welcome to Statistics: helping, and helping fight, bullshit.

 

 

 

Britain’s Favourite Crisps

 

As I’ve mentioned before, my aim in this blog is to raise awareness and understanding of statistical concepts and procedures, particularly with regard to potential applications in sports modelling. Often this will involved discussing particular techniques and methodologies. But sometimes it might involve simply referencing the way statistics has been used to address some particular important topic of the day.

With this latter point in mind, Channel 5 recently showed a program titled ‘Britain’s Favourite Crisps’ in which they revealed the results of a survey investigating, well, Britain’s favourite crisps. Now, if your cultural roots are not based in the UK, the complexities of crisp preference might seem as strange as the current wrangling over Brexit. But those of you who grew up in the UK are likely to be aware of the sensitivities of this issue. Crisp preferences, that is. Let’s not get started on Brexit.

A summary of the results of the survey are contained in the following diagram:

And a complete ranking of the top 20 is included here.

As you might expect for such a contentious issue, the programme generated a lot of controversy. For example:

And so on.

Personally, I’m mildly upset – I won’t say outraged exactly – at Monster Munch appearing only in the Mid-Tier. But let me try to put my own biases aside and turn to some statistical issues. These results are based on some kind of statistical survey, but this raises a number of questions. For example:

  1. How many people were included in the survey?
  2. How were they interviewed? Telephone? Internet? Person-to-person?
  3. How were they selected? Completely randomly? Or balanced to reflect certain demographics? Something else?
  4. What were they asked? Just their favourite? Or a ranking of their top 20 say?
  5. Were participants given a list of crisps to choose from, or were they given complete freedom of choice?
  6. Is it fair to consider Walkers or Pringles as single categories, when they cover many different flavours, while other crisps, such as Quavers, have just a single variety?
  7. How were results calculated? Just straight averages based on sample results, or weighted to correct demographic imbalances in the survey sample?
  8. How was the issue of non-respondents handled?
  9. How certain can we be that the presented results are representative of the wider population?
  10. Is a triangle appropriate for representing the results? It suggests the items in each row are equivalent. Was that intended? If so, is it justified by the results?

It may be that some of these questions are answered in the programme itself. Unfortunately, living outside the UK, I can’t access the programme, but those of you based in the UK can, at least for some time, here. So, if you are able to watch it and get answers to any of the questions, please post them in the comments section. But my guess is that most of the questions will remain unanswered.

So, what’s the point? Well, statistical analyses of any type require careful design and analysis. Decisions have to be made about the design and execution of an experiment, and these are likely to influence the eventual results. Consequently, the analysis itself should also take into account the way the experiment was designed, and attempt to correct for potential imbalances. Moreover, a proper understanding of the results of a statistical analysis require detailed knowledge of all aspects of the analysis, from design to analysis.

And the message is, never take results of a statistical analysis on trust. Ask questions. Query the design. Ask where the data came from. Check the methodology. Challenge the results. Ask about accuracy. Question whether the results have been presented fairly.

Moreover, remember that Statistics is as much an art as a science. Both the choice of design of an experiment and the randomness in data mean that a different person carrying out the same analysis is likely to get different results.

And all of this is as true for sports modelling as it is for the ranking of Britain’s favourite crisps.

The Datasaurus Dataset

Look at the data in this table. There are 2 rows of data labelled g1 and g2.  I won’t, for the moment, tell you where the data come from, except that the data are in pairs. So, each column of the table represents a pair of observations: (2, 1) is the first pair, (3, 5) is the second pair and so on. Just looking at the data, what would you conclude?

Scroll down once you’ve thought about this question.

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

Maybe you’re better at this stuff than me, but I wouldn’t find this an easy question to answer. Even though there are just 10 observations, and each observation contains just a pair of values, I find it difficult to simply look at the numbers and see any kind of pattern at all, either in the individual rows of numbers, or in any possible relationship between the two. And if it’s difficult in this situation, it’s bound to be much more difficult when there might be many thousands or millions of observations, and each observation might not be just a pair, but several – perhaps many – numbers.

So, not easy. But it’s a standard statistical requirement: taking a set of observations – in this case pairs – and trying to understand what they might convey about the process they come from. It’s really the beating heart of Statistics: trying to understand structure from data. Yet even with just 10 pairs of observations, the task isn’t straightforward.

To deal with this problem an important aspect of statistic analysis is the  summarisation of data – reducing the information they contain to just a few salient features. Specifically, in this case, reducing the information that’s contained in the 10 pairs of observations to a smaller number of numbers – so-called statistics – that summarise the most relevant aspects of the information that the data contain. The most commonly-used statistics, as you probably know, are:

  1. The means: the average values of each of the g1 and g2 sets of values.
  2. The standard deviations: measures of spread around the means of each of the g1 and g2 sets of values.
  3. The correlation: a measure, on a scale of -1 to 1, of the tendency for the g1 and g2 values to be related to each other.

The mean is well-known. The standard deviation is a measure of how spread out a set of values are: the more dispersed the numbers, the greater the standard deviation. Correlation is maybe less well understood, but provides a measure of the extent to which 2 sets of variables are linked to one another (albeit in a linear sense).

So, rather than trying to identify patterns in a set of 10 pairs of numbers, we reduce the data to their main features:

  • g1 mean = 2.4; g2 mean = 1.8
  • g1 standard deviation = 0.97; g2 standard deviation = 1.48
  • (g1,g2) correlation = 0.22

And from this we can start to build a picture of what the data tell us:

  1. The average value of g1 is rather greater – actually 0.6 greater – than the mean of g2, so there is a tendency for the g1 component of a pair to be bigger than the g2 component.
  2. The g2 values are more spread out than the g1 values.
  3. The positive value of correlation, albeit a value substantially lower than the maximum of 1, suggests that there is a tendency for the g1 and g2 components to be associated: bigger values of g1 tend to imply bigger values of g2.

So now let me tell you what the data are: they are the home and away scores, g1 and g2 respectively, in the latest round of games – matchday 28- in Serie A. So, actually, the summary values make quite good sense: the mean of g1 is greater than the mean of g2, which is consistent with a home advantage effect. And it’s generally accepted that home and away scores tend to be positively correlated. It’s maybe a little surprising that the standard deviation of away goals is greater than that of home goals, but with just 10 games this is very likely just to be a chance occurrence.

Which gives rise to a different issue: we’re unlikely to be interested in the patterns contained in the data from these particular 10 games. It’s much more likely we’re interested in what they might tell us about the pattern of results in a wider set of games –  perhaps Serie A games from any arbitrary matchday.

But that’s a story for another post sometime. The point of this post is that we’re simply not programmed to look at large (or even quite small) datasets and be able to see any patterns or messages they might contain.  Rather, we have to summarise data with just a few meaningful statistics in order to understand and compare them.


But actually, all of the above is just a precursor to what I actually wanted to say in this post. Luigi.Colombo@smartodds.co.uk recently forwarded the following twitter post to the quant team on RocketChat. Press the start arrow to set off the animation.

As explained in the message, every single one of the images in this animation – including the passages from one of the main images to another –  has exactly the same summary statistics. Thats to say, the mean and standard deviation of both the x- and y-values stay the same, as does the correlation between the two sets of values.

So what’s the moral here? Well, as we saw above, reduction of data to simple summary statistics is immensely helpful in getting a basic picture of the structure of data. But: it is a reduction nonetheless, and something is lost. All of the datasets in the the twitter animation have identical summary statistics, yet the data themselves are dramatically different from one image to another.

So, yes, follow my advice above and use summary statistics to understand data better. But be aware that a summary of data is just that, a summary, and infinitely many other datasets will have exactly the same summary statistics. If it’s important to you that your data look more like concentric ellipses than a dinosaur, you’d better not rely on means and standard deviations to tell you so.