Picture this

You can’t help but be amazed at the recent release of the first ever genuine image of a black hole. The picture itself, and the knowledge of what it represents, are extraordinary enough, but the sheer feat of human endeavour that led to this image is equally breathtaking.

Now, as far as I can see from the list of collaborators that are credited with the image, actual designated statisticians didn’t really contribute. But, from what I’ve read about the process of the image’s creation, Statistics is central to the underlying methodology. I don’t understand the details, but the outline is something like this…

Although black holes are extremely big, they’re also a long way away. This one, for example, has a diameter that’s bigger than our entire solar system. But it’s also at the heart of the Messier 87 galaxy, some 55 million light years away from Earth. Which means that when looking towards it from Earth, it occupies a very small part of space. The analogy that’s been given is that capturing the black hole’s image in space would be equivalent to trying to photograph a piece of fruit on the surface of the moon. And the laws of optics imply this would require a telescope the size of our whole planet.

To get round this limitation, the Event Horizon Telescope (EHT) program uses simultaneous signals collected from a network of eight powerful telescopes stationed around the Earth. However, the result, naturally, is a sparse grid of signals rather than a complete image. The rotation of the earth means that with repeat measurements this grid gets filled-out a little. But still, there’s a lot of blank space that needs to be filled-in to complete the image. So, how is that done?

In principle, the idea is simple enough. This video was made some years ago by Katie Bouman, who’s now got worldwide fame for leading the EHT program to produce the black hole image:

The point of the video is that to recognise the song, you don’t need the whole keyboard to be functioning. You just need a few of the keys to be working – and they don’t even have to be 100% precise – to be able to identify the whole song. I have to admit that the efficacy of this video was offset for me by the fact that I got the song wrong, but in the YouTube description of the video, Katie explains this is a common mistake, and uses the point to illustrate that with insufficient data you might get the wrong answer. (I got the wrong answer with complete data though!)

In the case of the music video, it’s our brain that fills in the gaps to give us the whole tune. In the case of the black hole data, it’s sophisticated and clever picture imaging techniques, that rely on the known physics of light transmission and a library of the patterns found in images of many different types. From this combination of physics and library of image templates, it’s possible to extrapolate from the observed data to build proposal images, and for each one find a score of how plausible that image is. The final image is then the one that has the greatest plausibility score. Engineers call this image reconstruction; but the algorithm is fundamentally statistical.

At least, that’s how I understood things. But here’s Katie again giving a much  better explanation in a Ted talk:

Ok, so much for black holes. Now, think of:

  1. Telescopes as football matches;
  2. Image data as match results;
  3. The black hole as a picture that contains information about how good football teams really are;
  4. Astrophysics as the rules by which football matches are played;
  5. The templates that describe how an image changes from one pixel to the next as a rule for saying how team performances might change from one game to the next.

And you can maybe see that in a very general sense, the problem of reconstructing an image of a black hole has the same elements as that of estimating the abilities of football teams. Admittedly, our football models are rather less sophisticated, and we don’t need to wait for the end of the Antarctic winter to ship half a tonne of hard drives containing data back to the lab for processing. But the principles of Statistics are generally the same in all applications, from black hole imaging to sports modelling, and everything in between.

Famous statisticians: Sir Francis Galton



This is the second in a so-far very short series on famous statisticians from history. You may remember that the first in the series was on John Tukey. As I said at that time, rather than just include statisticians randomly in this series, I’m going to focus on those who have had an impact beyond the realm of just statistics.

With that in mind, this post is about Sir Francis Galton (1822-1911), an English statistician who did most of his work in the second half of the 19th century, around the time that Statistics was being born as a viable scientific discipline.

You may remember seeing Galton’s name recently. In a recent post on the bean machine, I mentioned that the device also goes under the name of ‘Galton board’. This is because Galton was the inventor of the machine, which he used to illustrate the Central Limit Theorem, as discussed in the earlier post. You may also remember an earlier post in which I discussed `regression to the mean’; Galton was also the first person to explore and describe this phenomenon, as well as the more general concept of correlation to describe the extent to which two random phenomena are connected.

It’s probably no coincidence that Galton was a half-cousin of Charles Darwin, since much of Galton’s pioneering work was on the way statistics could be used to understand genetic inheritance and human evolution. Indeed, he is the inventor of the term eugenics, which he coined during his attempts to understand the extent to which intelligence is inherited, rather than developed.

Galton is described in Wikipedia as:

  • A statistician
  • A progressive
  • A polymath
  • A sociologist
  • A psychologist
  • An anthropologist
  • A eugenicist
  • A tropical explorer
  • A geographer
  • An inventor
  • A meteorologist
  • A proto-geneticist
  • A psychometrician

And you thought you were busy. Anyway, it’s fair to say that Galton falls in my category of statisticians who have done something interesting with their lives outside of Statistics.

His various contributions apart from those mentioned above include:

  1. He invented the use of weather maps for popular use;
  2. He wrote a book ‘The Art of Travel’ which offered practical travel advice to Victorians;
  3. He was the first to propose the use of questionnaires as a means of data collection;
  4. He conceived the notion of standard deviation as a way of summarising the variation in data;
  5. He devised a technique called composite portraiture which was an early version of photoshop for making montages of photographic portraits;
  6. He pretty much invented the technique of fingerprinting for identifying  individuals by their fingerprints.

In summary, many of the things Galton worked on or invented are still relevant today. And this is just as true for his non-statistical contributions, as for his statistical ones. Of course, it’s an unfortunate historical footnote that his theory of eugenics – social engineering to improve biological characteristics in populations – was adopted and pushed to extremes in Nazi Germany, with unthinkable consequences.

In retrospect, it’s a pity he didn’t just stop once he’d invented the bean machine.


The gene genie

One of the most remarkable advances in scientific understanding over the last couple of hundred years has been Mendelian genetics. This theory explains the basics of genetic inheritance, and is named after its discoverer, Gregor Mendel, who developed the model based on observations of the characteristics of peas when cross-bred from different varieties. In his most celebrated experiment, he crossed pure yellow with pure green peas, and obtained a generation consisting of only yellow peas. But in the subsequent generation, when these peas were crossed, he obtained a mixed generation of yellow and green peas. Mendel constructed the theory of genes and alleles to explain this phenomenon, which subsequently became the basis of modern genetic science.

You probably know all this anyway, but if you’re interested and need a quick reminder, here’s a short video giving an outline of the theory.

Mendel’s pea experiment was very simple, but from the model he developed he was able to calculate the proportion of peas of different varieties to be expected in subsequent generations. For example, in the situation described above, the theory suggests that there would be no green peas in the first generation, but around 1/4 of the peas in the second generation would be expected to be green.

Mendel’s theory extends to more complex situations; in particular it allows for the inheritance of multiple characteristics. In the video, for example, the characteristic for peas to be yellow/green is supplemented by their propensity to be round/wrinkled. Mendel’s model leads to predictions of the proportion of peas in each generation when stratified  by both these characteristics: round and green, or yellow and wrinkled etc etc.

The interesting thing from a statistical point of view is the way Mendel verified his theory. All scientific theories go through the same validation process: first there are some observations; second those observations lead to a theory; and third there is a detailed scrutiny of further observations to ensure that they are consistent with the theory. If they are, then the theory stands, at least until there are subsequent observations which violate the theory, or a better theory is developed to replace the original.

Now, where there is randomness in the observations, the procedure of ensuring that the observations are in agreement with the theory is more complicated. For example, consider the second generation of peas in the experiment above. The theory suggests that, on average, 1/4 of the peas should be green. So if we take 100 peas from the second generation, we’d expect around 25 of them to be green. But that’s different from saying exactly 25 should be green. Is it consistent with the theory if we get 30 green peas? Or 40? At what point do we decide that the experimental evidence is inconsistent with the theory? This is the substance of Statistics.

Actually, the theory of Mendelian inheritance can be expressed entirely in terms of statistical models. There is a specific probability that certain characteristics are passed on from parents to offspring, and this leads to expected proportions of different types in subsequent generations. And expressed this way, we don’t just learn that 1/4 of second generation peas should be green, but also the probability that in a sample of 100 we get 30, 40 or any number of green peas.

And this leads to something extremely interesting: Mendel’s experimental results are simply too good to be true. For example – though I’m actually making the numbers up here – in repeats of the simple pea experiment he almost always got something very close to 25 green peas out of 100. As explained above, the statistics behind Mendelian inheritance do indeed say that he should have got an average of 25 per population. But the same theory also implies that 20 or 35 green peas out of 100 are entirely plausible, and indeed a spread of experimental results between 20 and 35 is to be expected. But, each of Mendel’s experiments gave a number very close to 25. Ironically, if these really were the experimental results, they would be in violation of the theory, which expects not just an average of 25, but with an appropriate amount of variation around that figure.

So, Mendel’s experimental results were actually a primitive example of fake news. But here’s the thing: Mendel’s theory has subsequently been shown to be correct, even if it seems likely that the evidence he presented had been manipulated to strengthen its case. In modern parlance, Mendel focused on making sure his results supported the predicted average, but failed to appreciate that the theory also implied something about the variation in observations. So even if the experimental results were fake news, the theory itself has been shown to be anything but fake.

To be honest, there is some academic debate about whether Mendel cheated or not. As far as I can tell though, this is largely based on the assumption that since he was also a monk and a highly-regarded scientist, cheating would have been out of character. Nobody really denies the fact that the statistics really are simply too good to be true. Of course, in the end, it really is all academic, as the theory has been proven to be correct and is the basis for modern genetic theory. If interested, you can follow the story a little further here.

Incidentally, the fact that statistical models speak about variation as well as about averages is essential to the way they get used in sports modelling. In football, for example, models are generally estimated on the basis of the average number of goals a team is expected to score. But the prediction of match scores as a potential betting aid requires information about the variation in the number of goals around the average value. And though Mendel seems not to have appreciated the point, a statistical model contains information on both averages and variation, and if a model is to be suitable for data, the data will need to be consistent with the model in terms of both aspects.


It’s not based on facts

We think that this is the most extreme version and it’s not based on facts. It’s not data-driven. We’d like to see something that is more data-driven.

Wow! Who is this staunch defender of statistical methodology? This guardian of scientific method. This warrior of the value of empirical information to help identify and confirm a truth.

Ah, but wait a minute, here’s the rest of the quote…

It’s based on modelling, which is extremely hard to do when you’re talking about the climate. Again, our focus is on making sure we have the safest, cleanest air and water.

Any ideas now?

Since it requires an expert in doublespeak to connect those two quotes together, you might be thinking Donald Trump, but we’ll get to him in a minute. No, this was White House spokesperson Sarah Sanders in response to the US government’s own assessment of climate change impact. Here’s just one of the headlines in that report (under the Infrastructure heading):

Our Nation’s aging and deteriorating infrastructure is further stressed by increases in heavy precipitation events, coastal flooding, heat, wildfires, and other extreme events, as well as changes to average precipitation and temperature. Without adaptation, climate change will continue to degrade infrastructure performance over the rest of the century, with the potential for cascading impacts that threaten our economy, national security, essential services, and health and well-being.

I’m sure I don’t need to convince you of the overwhelming statistical and scientific evidence of climate change. But for argument’s sake, let me place here again a graph that I included in a previous post

This is about as data-driven as you can get. Data have been carefully sourced and appropriately combined from locations all across the globe. Confidence intervals have been added – these are the vertical black bars – which account for the fact that we’re estimating a global average on the basis of a limited number of data. But you’ll notice that the confidence bars are smaller for more recent years, since more data of greater reliability is available. So it’s not just data, it’s also careful analysis of data that takes into account that we are estimating something. And it plainly shows that, even after allowance for errors due to data limitation, and also allowance for year-to-year random variation, there has been an upward trend for at least the last 100 years,  which is even more pronounced in the last 40 years.

Now, by the way, here’s a summary of the mean annual total of CO2 that’s been released into the atmosphere over roughly the same time period.

Notice any similarities between these two graphs?

Now, as you might remember from my post on Simpson’s Paradox, correlations are not necessarily evidence of causation. It could be, just on the strength of these two graphs, that both CO2 emission and global mean temperature are being affected by some other process, which is causing them both to change in a similar way. But, here’s the thing: there is a proven scientific mechanism by which an increase in CO2 can cause an increase in atmospheric temperature. It’s basically the greenhouse effect: CO2 particles cause heat to be retained in the atmosphere, rather than reflected back into space, as would be the case if those particles weren’t there. So:

  1. The graphs show a clear correlation between C02 levels and mean temperature levels;
  2. CO2 levels in the atmosphere are rising and bound to rise further under current energy polices worldwide;
  3. There is a scientific mechanism by which increased CO2 emissions lead to an increase in mean global temperature.

Put those three things together and you have an incontrovertible case that climate change is happening, that it’s at least partly driven by human activity and that the key to limiting the damaging effects of such change is to introduce energy policies that drastically reduce C02 emissions.

All pretty straightforward, right?

Well, this is the response to his own government’s report by the President of the United States:

In summary:

I don’t believe it

And the evidence for that disbelief:

One of the problems that a lot of people like myself — we have very high levels of intelligence, but we’re not necessarily such believers.

If only the President of the United States was just a little less intelligent. And if only his White House spokesperson wasn’t such an out-and-out liar.


Cricket statistics versus quantum mechanics


My intention in this blog is to keep the technical level easy enough for everyone to understand. But just in case you struggle to understand something, don’t worry, you’re not alone…

Arguably, one of the most successful applications of statistics to sport in recent years has been the invention of the Duckworth-Lewis method for cricket. As you probably know, for one-day cricket matches each team gets to bowl a fixed number of balls, from which the other team has to make as many runs as possible before they run out of players or balls. The team scoring the most runs wins.

But a difficulty arises when rain interrupts play, and forces one (or both) of the teams to receive a reduced number of balls. Suppose, for example, the first team scored 130 runs from 120 balls. The second team then has to score 130 runs from their allocated 120 balls to draw, or 131 to win. But suppose it rains before the second team starts, and there will only be time for them to receive 60 balls. What is a reasonable target for them to have to reach to win? A first guess might be 65 runs, but that doesn’t take account of the fact that they have 11 batsmen, and can therefore take higher risks when batting than the first team had to. This type of scenario, and other more complicated ones where, for example, both teams have reduced numbers of balls was examined by a pair of statisticians, Frank Duckworth and Tony Lewis, who developed the so-called  Duckworth Lewis method for determining targets in rain-reduced matches. This method, or a variant of it, is now standard in all national and international one-day cricket competitions.

The heart of the method is the use of statistical techniques to quantify the amount of resources a team has available at any stage in their innings. There are two contributions to the resources: the number of balls to be bowled and the number of batsmen who have not already been dismissed. The trick is to combine these contributions in a way that gives a fair measure of overall resources. Once this is done a fair target for the team batting second can be derived, even if the number of balls they will face is reduced due to rain.

That’s as far as I’m going to discuss the method here (though I may return to it in a future post). The point I want to make now is that although statistical ideas are often simple and intuitive in conception, they often seem bafflingly complex in their final form.

Professor Brian Cox  is one of the country’s most eminent astrophysicists. He’s also a fantastic communicator, and has been involved in many tv productions helping explain difficult scientific ideas to a wide public audience. Here he is explaining quantum mechanics in 60 seconds…

And here he is trying (and apparently failing) to make sense of the Duckworth-Lewis method:

So, if ever you struggle to understand something in statistics, you’re in good company. But strip away all of the mathematical gobbledygook and the basic idea is likely to be actually very simple. (It’s not quantum physics).


So sad about the leopards

At the recent offsite, Nity.Raj@smartodds.co.uk suggested I do a post on the statistics of climate change. I will do that properly at some point, but there’s such an enormous amount of material to choose from, that I don’t really know where to start or how best to turn it into the “snappy and informative, but fun and light-hearted” type of post that you’ve come to expect from Smartodds loves Statistics.

So, in the meantime, I’ll just drop the following cartoon, made by First Dog on the Moon, who has a regular series in the Guardian. It’s not exactly about climate science, but similar in that it points at humanity’s failures to face up to and confront the effects we are having on our planet, despite the overwhelming statistical and scientific evidence of both the effect and its consequences. It specifically refers to the recent WWF report which confirms, amongst other things, that humanity has wiped out 60% of the world’s animal population since 1970.

Responding to the report, the Guardian quotes Prof Johan Rockström, a global sustainability expert at the Potsdam Institute for Climate Impact Research in Germany, as follows:

We are rapidly running out of time. Only by addressing both ecosystems and climate do we stand a chance of safeguarding a stable planet for humanity’s future on Earth.

Remember, kids: “Listen to the scientists and not the Nazis”.