Picture this

You can’t help but be amazed at the recent release of the first ever genuine image of a black hole. The picture itself, and the knowledge of what it represents, are extraordinary enough, but the sheer feat of human endeavour that led to this image is equally breathtaking.

Now, as far as I can see from the list of collaborators that are credited with the image, actual designated statisticians didn’t really contribute. But, from what I’ve read about the process of the image’s creation, Statistics is central to the underlying methodology. I don’t understand the details, but the outline is something like this…

Although black holes are extremely big, they’re also a long way away. This one, for example, has a diameter that’s bigger than our entire solar system. But it’s also at the heart of the Messier 87 galaxy, some 55 million light years away from Earth. Which means that when looking towards it from Earth, it occupies a very small part of space. The analogy that’s been given is that capturing the black hole’s image in space would be equivalent to trying to photograph a piece of fruit on the surface of the moon. And the laws of optics imply this would require a telescope the size of our whole planet.

To get round this limitation, the Event Horizon Telescope (EHT) program uses simultaneous signals collected from a network of eight powerful telescopes stationed around the Earth. However, the result, naturally, is a sparse grid of signals rather than a complete image. The rotation of the earth means that with repeat measurements this grid gets filled-out a little. But still, there’s a lot of blank space that needs to be filled-in to complete the image. So, how is that done?

In principle, the idea is simple enough. This video was made some years ago by Katie Bouman, who’s now got worldwide fame for leading the EHT program to produce the black hole image:

The point of the video is that to recognise the song, you don’t need the whole keyboard to be functioning. You just need a few of the keys to be working – and they don’t even have to be 100% precise – to be able to identify the whole song. I have to admit that the efficacy of this video was offset for me by the fact that I got the song wrong, but in the YouTube description of the video, Katie explains this is a common mistake, and uses the point to illustrate that with insufficient data you might get the wrong answer. (I got the wrong answer with complete data though!)

In the case of the music video, it’s our brain that fills in the gaps to give us the whole tune. In the case of the black hole data, it’s sophisticated and clever picture imaging techniques, that rely on the known physics of light transmission and a library of the patterns found in images of many different types. From this combination of physics and library of image templates, it’s possible to extrapolate from the observed data to build proposal images, and for each one find a score of how plausible that image is. The final image is then the one that has the greatest plausibility score. Engineers call this image reconstruction; but the algorithm is fundamentally statistical.

At least, that’s how I understood things. But here’s Katie again giving a much  better explanation in a Ted talk:

Ok, so much for black holes. Now, think of:

  1. Telescopes as football matches;
  2. Image data as match results;
  3. The black hole as a picture that contains information about how good football teams really are;
  4. Astrophysics as the rules by which football matches are played;
  5. The templates that describe how an image changes from one pixel to the next as a rule for saying how team performances might change from one game to the next.

And you can maybe see that in a very general sense, the problem of reconstructing an image of a black hole has the same elements as that of estimating the abilities of football teams. Admittedly, our football models are rather less sophisticated, and we don’t need to wait for the end of the Antarctic winter to ship half a tonne of hard drives containing data back to the lab for processing. But the principles of Statistics are generally the same in all applications, from black hole imaging to sports modelling, and everything in between.

Calling BS

You have to be wary of newspaper articles published on 1 April, but I think this one is genuine. The Guardian on Monday contained a report about scientific research into bullshit. Or more specifically, a scientific/statistical study into the demographics of bullshitting.

Now, to make any sense of this, it’s important first to understand what bullshit is.  Bullshit is different from lying. The standard treatise in this field is ‘On Bullshit‘ by Harry Frankfurt. I’m not kidding. He writes:

It is impossible for someone to lie unless he thinks he knows the truth. Producing bullshit requires no such conviction

In other words, bullshitting is providing a version of events that gives the impression you know what you are talking about, when in fact you don’t.

Unfortunately, standard dictionaries tend to define bullshitting as something like ‘talking nonsense’, though this is – irony alert – bullshit. This article explains why and includes the following example. Consider the phrase

Hidden meaning transforms unparalleled abstract beauty.

It argues that since the sentence is grammatically correct, but intellectually meaningless, it is an example of bullshit. On the other hand, the same set of words in a different order, for example

Unparalleled transforms meaning beauty hidden abstract.

are simply nonsense. Since they lack grammatical structure, the author isn’t bullshitting. He’s just talking garbage.

So, bullshit is different from lying in that the bullshitter will generally not know the truth; and it’s different from nonsense in that it has specific intent to deceive or misdirect.

But back to the Guardian article. The statistical study it refers to reveals a number of interesting outcomes:

  • Boys bullshit more than girls;
  • Children from higher socioeconomic backgrounds tend to bullshit more than those from poorer backgrounds;
  • North Americans bullshit the most (among the countries studied);
  • Bullshitters tend to perceive themselves as self-confident and high achievers.

If only I could think of an example of a self-confident, North American male from a wealthy background with a strong tendency to disseminate bullshit in order to illustrate these points.

But what’s all this got to do with Statistics? Well, it cuts both ways. First, the cool logic of Statistics can be used to identify and correct bullshit. Indeed, if you happen to study at the University of Washington, you can enrol for the course ‘Calling Bullshit: Data Reasoning in a Digital World‘, which is dedicated to the subject. The objectives for this course, as listed in its syllabus, are that after the course you should be able to:

  • Remain vigilant for bullshit contaminating your information diet.
  • Recognize said bullshit whenever and wherever you encounter it.
  • Figure out for yourself precisely why a particular bit of bullshit is bullshit.
  • Provide a statistician or fellow scientist with a technical explanation of why a claim is bullshit.
  • Provide your crystals-and-homeopathy aunt or casually racist uncle with an accessible and persuasive explanation of why a claim is bullshit.

I especially like the fact that after following this course you’ll be well-equipped to take on both the renegade hippy and racist wings of your family.

So that’s the good side of things. On the bad side, it’s extremely easy to use Statistics to disseminate bullshit. Partly because not everyone is sufficiently clued-up to really understand statistical concepts and to be critical when confronted with them; and partly because, even if you have a good statistical knowledge and are appropriately sceptical, you’re still likely to have to rely on the accuracy of the analysis, without access to the data on which they were based.

For example, this article, which is an interesting read on the subject of Statistics and bullshit, discusses a widely circulated fact, attributed to the Crime Statistics Bureau of San Francisco, that:

81% of white homicide victims were killed by blacks

Except, it turns out, that the Crime Statistics Bureau of San Francisco doesn’t exist and FBI figures actually suggest that 80% of white murder victims were killed by other white people. So, it’s a bullshit statement attributed to  a bullshit organisation. But with social media, the dissemination of these mis-truths becomes viral, and it becomes impossible to enable corrections with actual facts. Indeed, the above statement was included in an image posted to twitter by Donald Trump during his election campaign: full story here. And that tweet alone got almost 7000 retweets. So though, using reliable statistics, the claim is easily disproved, the message is already spread and the damage done.

So, welcome to Statistics: helping, and helping fight, bullshit.




Britain’s Favourite Crisps


As I’ve mentioned before, my aim in this blog is to raise awareness and understanding of statistical concepts and procedures, particularly with regard to potential applications in sports modelling. Often this will involved discussing particular techniques and methodologies. But sometimes it might involve simply referencing the way statistics has been used to address some particular important topic of the day.

With this latter point in mind, Channel 5 recently showed a program titled ‘Britain’s Favourite Crisps’ in which they revealed the results of a survey investigating, well, Britain’s favourite crisps. Now, if your cultural roots are not based in the UK, the complexities of crisp preference might seem as strange as the current wrangling over Brexit. But those of you who grew up in the UK are likely to be aware of the sensitivities of this issue. Crisp preferences, that is. Let’s not get started on Brexit.

A summary of the results of the survey are contained in the following diagram:

And a complete ranking of the top 20 is included here.

As you might expect for such a contentious issue, the programme generated a lot of controversy. For example:

And so on.

Personally, I’m mildly upset – I won’t say outraged exactly – at Monster Munch appearing only in the Mid-Tier. But let me try to put my own biases aside and turn to some statistical issues. These results are based on some kind of statistical survey, but this raises a number of questions. For example:

  1. How many people were included in the survey?
  2. How were they interviewed? Telephone? Internet? Person-to-person?
  3. How were they selected? Completely randomly? Or balanced to reflect certain demographics? Something else?
  4. What were they asked? Just their favourite? Or a ranking of their top 20 say?
  5. Were participants given a list of crisps to choose from, or were they given complete freedom of choice?
  6. Is it fair to consider Walkers or Pringles as single categories, when they cover many different flavours, while other crisps, such as Quavers, have just a single variety?
  7. How were results calculated? Just straight averages based on sample results, or weighted to correct demographic imbalances in the survey sample?
  8. How was the issue of non-respondents handled?
  9. How certain can we be that the presented results are representative of the wider population?
  10. Is a triangle appropriate for representing the results? It suggests the items in each row are equivalent. Was that intended? If so, is it justified by the results?

It may be that some of these questions are answered in the programme itself. Unfortunately, living outside the UK, I can’t access the programme, but those of you based in the UK can, at least for some time, here. So, if you are able to watch it and get answers to any of the questions, please post them in the comments section. But my guess is that most of the questions will remain unanswered.

So, what’s the point? Well, statistical analyses of any type require careful design and analysis. Decisions have to be made about the design and execution of an experiment, and these are likely to influence the eventual results. Consequently, the analysis itself should also take into account the way the experiment was designed, and attempt to correct for potential imbalances. Moreover, a proper understanding of the results of a statistical analysis require detailed knowledge of all aspects of the analysis, from design to analysis.

And the message is, never take results of a statistical analysis on trust. Ask questions. Query the design. Ask where the data came from. Check the methodology. Challenge the results. Ask about accuracy. Question whether the results have been presented fairly.

Moreover, remember that Statistics is as much an art as a science. Both the choice of design of an experiment and the randomness in data mean that a different person carrying out the same analysis is likely to get different results.

And all of this is as true for sports modelling as it is for the ranking of Britain’s favourite crisps.

The Datasaurus Dataset

Look at the data in this table. There are 2 rows of data labelled g1 and g2.  I won’t, for the moment, tell you where the data come from, except that the data are in pairs. So, each column of the table represents a pair of observations: (2, 1) is the first pair, (3, 5) is the second pair and so on. Just looking at the data, what would you conclude?

Scroll down once you’ve thought about this question.


Maybe you’re better at this stuff than me, but I wouldn’t find this an easy question to answer. Even though there are just 10 observations, and each observation contains just a pair of values, I find it difficult to simply look at the numbers and see any kind of pattern at all, either in the individual rows of numbers, or in any possible relationship between the two. And if it’s difficult in this situation, it’s bound to be much more difficult when there might be many thousands or millions of observations, and each observation might not be just a pair, but several – perhaps many – numbers.

So, not easy. But it’s a standard statistical requirement: taking a set of observations – in this case pairs – and trying to understand what they might convey about the process they come from. It’s really the beating heart of Statistics: trying to understand structure from data. Yet even with just 10 pairs of observations, the task isn’t straightforward.

To deal with this problem an important aspect of statistic analysis is the  summarisation of data – reducing the information they contain to just a few salient features. Specifically, in this case, reducing the information that’s contained in the 10 pairs of observations to a smaller number of numbers – so-called statistics – that summarise the most relevant aspects of the information that the data contain. The most commonly-used statistics, as you probably know, are:

  1. The means: the average values of each of the g1 and g2 sets of values.
  2. The standard deviations: measures of spread around the means of each of the g1 and g2 sets of values.
  3. The correlation: a measure, on a scale of -1 to 1, of the tendency for the g1 and g2 values to be related to each other.

The mean is well-known. The standard deviation is a measure of how spread out a set of values are: the more dispersed the numbers, the greater the standard deviation. Correlation is maybe less well understood, but provides a measure of the extent to which 2 sets of variables are linked to one another (albeit in a linear sense).

So, rather than trying to identify patterns in a set of 10 pairs of numbers, we reduce the data to their main features:

  • g1 mean = 2.4; g2 mean = 1.8
  • g1 standard deviation = 0.97; g2 standard deviation = 1.48
  • (g1,g2) correlation = 0.22

And from this we can start to build a picture of what the data tell us:

  1. The average value of g1 is rather greater – actually 0.6 greater – than the mean of g2, so there is a tendency for the g1 component of a pair to be bigger than the g2 component.
  2. The g2 values are more spread out than the g1 values.
  3. The positive value of correlation, albeit a value substantially lower than the maximum of 1, suggests that there is a tendency for the g1 and g2 components to be associated: bigger values of g1 tend to imply bigger values of g2.

So now let me tell you what the data are: they are the home and away scores, g1 and g2 respectively, in the latest round of games – matchday 28- in Serie A. So, actually, the summary values make quite good sense: the mean of g1 is greater than the mean of g2, which is consistent with a home advantage effect. And it’s generally accepted that home and away scores tend to be positively correlated. It’s maybe a little surprising that the standard deviation of away goals is greater than that of home goals, but with just 10 games this is very likely just to be a chance occurrence.

Which gives rise to a different issue: we’re unlikely to be interested in the patterns contained in the data from these particular 10 games. It’s much more likely we’re interested in what they might tell us about the pattern of results in a wider set of games –  perhaps Serie A games from any arbitrary matchday.

But that’s a story for another post sometime. The point of this post is that we’re simply not programmed to look at large (or even quite small) datasets and be able to see any patterns or messages they might contain.  Rather, we have to summarise data with just a few meaningful statistics in order to understand and compare them.

But actually, all of the above is just a precursor to what I actually wanted to say in this post. Luigi.Colombo@smartodds.co.uk recently forwarded the following twitter post to the quant team on RocketChat. Press the start arrow to set off the animation.

As explained in the message, every single one of the images in this animation – including the passages from one of the main images to another –  has exactly the same summary statistics. Thats to say, the mean and standard deviation of both the x- and y-values stay the same, as does the correlation between the two sets of values.

So what’s the moral here? Well, as we saw above, reduction of data to simple summary statistics is immensely helpful in getting a basic picture of the structure of data. But: it is a reduction nonetheless, and something is lost. All of the datasets in the the twitter animation have identical summary statistics, yet the data themselves are dramatically different from one image to another.

So, yes, follow my advice above and use summary statistics to understand data better. But be aware that a summary of data is just that, a summary, and infinitely many other datasets will have exactly the same summary statistics. If it’s important to you that your data look more like concentric ellipses than a dinosaur, you’d better not rely on means and standard deviations to tell you so.

The numbers game

If you’re reading this post, you’re likely to be aware already of the importance of Statistics and data for various aspects of sport in general and football in particular. Nonetheless, I recently came across this short film, produced by FourFourTwo magazine, which gives a nice history of the evolution of data analytics in football. If you need a refresher on the topic, this isn’t a bad place to look.

And just in case you don’t think that’s sufficient to justify this post in a Statistics blog, FourFourTwo claims to be ‘the world’s biggest football magazine’. Moreover, many of the articles on the magazine’s website are analytics-orientated. For example: ‘Ronaldo averaged a game every 4.3 days‘. Admittedly, many of these articles are barely-disguised advertisements for a wearable GPS device intended for tracking activity of players during matches. But I suppose even 199 (pounds)  is a number, right?


The benefit of foresight

Ok, I’m going to be honest… I’m not really happy with this post. I keep deleting it and re-writing it, but can’t get it in a form where it eloquently says what I want it to say. (Insert your own <like all of your other posts> joke here).

I’m trying to say the following things:

  1. Trading in sports – or any field – is about predicting what will happen in the future;
  2. Data are a summary of the past. If the future behaves like the past, then the data are likely to be useful; if it doesn’t, they’re likely to be less useful;
  3. There is often information about the way things are likely to change in the future that’s external to, and not included in, data;
  4. This means that predictions for sports trading based on statistical procedures will always be improved by the inclusion of additional knowledge and information that is provided by experts.

That’s what the rest of this post is trying to say. Unfortunately, it’s an admission of a poor post that I’m having to tell you this in advance, rather than letting you draw these conclusions yourself.


It’s often said that ‘with the benefit of hindsight, things could have been done better’. But since hindsight isn’t available when trading on sports, the best we can do is make optimal use of foresight.

This season has been a record-breaker for the NFL. Among other tumbling records, at 1371, the number of touchdowns in the regular season is the largest in the league’s 99-year history.

Of course, random variation means records will be broken from time to time just by chance, but if this sudden increase in points was actually predictable, then bets placed on NFL would have been improved if they had taken this into account.

Naturally, as statisticians, our primary source of evidence is contained in data, and we aim to exploit basic patterns and trends in data to help make predictions for the future. But data are by definition a snapshot of the past, and the models we develop will only work well if the future behaves like the past. Admittedly, if changes have already occurred, these will be encapsulated in data, and can be extrapolated into predictive models for the future. But data do not, in themselves, describe mechanisms of change.  And it will always be essential to use additional sources of information and knowledge, not contained in data, to temper, inform and modify predictions from data-based statistical models.

With all that in mind, I found this article an interesting read. It provides a chronology of events connected to the NFL, all of which have contributed one way or another to the current attack-based tendency of play. The foresight to use this knowledge at the start of the season, to modify predictions to account for a likely increase in points due to a greater emphasis on attack, would almost certainly have led to better predictions than those provided by using data-based models only.




Statistics on match day

In an earlier post I discussed how the use of detailed in-play statistics was becoming much more important for sports modelling, and we looked at a video made at OPTA where they discuss how the data from a single event in a match is converted into a database entry. In that video there was reference to another video showing OPTA ‘behind-the-scenes’ on a typical match day. You can now see that video below.

Again, this video is a little old now, and chances are that OPTA now use fully genuine copies of Windows (see video at 2.07), but I thought again it might be of interest to see the process by which some of our data are collected. In future posts we might discuss the nature of some of the data that they are collecting.

Anatomy of a goal

One way of trying to improve sports models is to adapt them to include extra information. In football, for example, rather than just using goals from past fixtures, you might try to include more detailed information about how those fixtures played out.

It’s a little old now – 2013 – but I recently came across the video below. As you probably know, OPTA is the leading provider of in-match sports data, giving detailed quantitative measures of every event and every player in a match, not just in football, but for many other sports as well.

In this video, Sam from OPTA is discussing the data derived from a single event in a football match: Iniesta’s winner in the 2010 world cup final. I think it’s interesting because we tend to treat the data as our raw ingredients, but there is a process by which the action in a game is converted into data, and this video gives insights into that actual process.

In future posts we might look at how some of the data collected this way is used in models.

Incidentally, this video was produced by numberphile, a group of nerds maths enthusiasts who make fun (well, you know, “fun”)  YouTube videos on all aspects of maths and numbers, including, occasionally, statistics. Chances are I’ll be digging through their archives to see if there’s anything else I can steal borrow for the blog.

Question: if you watch the video carefully, you will see at some point (2:12, precisely) that event type number 31 is “Picked an orange”. What is that about? Is “picked an orange” a colloquialism for something? Forgive my ignorance, but I have simply no idea, and would be really happy if someone could explain.

Update: Here are 2 plausible explanations from Ian.Rutherford@smartbapps.co.uk

  1. The keeper catches a cross
  2. Yellow card given, but could’ve been red

If anyone knows the answer or has alternative suggestions I’ll include them here, thanks.

Actually, could it be this? When a match is played in snowy conditions, an orange ball is used to make it more visible. Maybe “picking an orange” refers to the decision to switch to such a ball by the referee.