You looking at me?

Statistics: helping you solve life’s more difficult problems…

You might have read recently – since it was in every news outlet here, here, here, here, here, here, and here for example – that recent research has shown that staring at seagulls inhibits them from stealing your food. This article even shows a couple of videos of how the experiment was conducted. The researcher placed a package of food some metres in front of her in the vicinity of a seagull. In one experiment she watched the bird and timed how long it took before it snatched the food. She then repeated the experiment, with the same seagull, but this time facing away from the seagull. Finally, she repeated this exercise with a number of different seagulls in different locations.

At the heart of the study is a statistical analysis, and there are several points about both the analysis itself and the way it was reported that are interesting from a wider statistical perspective:

  1. The experiment is a good example of a designed paired experiment. Some seagulls are more likely to take food than others regardless of whether they are being looked at or not. The experiment aims to control for this effect by using pairs of results from each seagull: one in which the seagull was stared at, the other where it was not. By using knowledge that the data are in pairs this way, the accuracy of the analysis is improved considerably. This makes it much more likely to identify a possible effect within the noisy data.
  2. To avoid the possibility that, for example, a seagull is more likely to take food quickly the second time, the order in which the pairs of experiments are applied is randomised for each seagull.
  3. Other factors are also controlled for in the analysis: the presence of other birds, the distance of the food, the presence of other people and so on.
  4. The original experiment involved 74 birds, but many were uncooperative and refused the food in one or other of the experiments. In the end the analysis is based on just 19 birds who took food both when being stared at and not. So even though results prove to be significant, it’s worth remembering that the sample on which results were based is very small.
  5. It used to be very difficult to verify the accuracy of a published statistical analysis. These days it’s almost standard for data and code to be published alongside the manuscript itself. This enables readers to both check the results and carry out their own alternative analyses. For this paper, which you can find in full here, the data and code are available here.
  6. If you look at the code it’s just a few lines from R. It’s notable that such a sophisticated analysis can be carried out with such simple code.
  7. At the risk of being pedantic, although most newspapers went with headlines like ‘Staring at seagulls is best way to stop them stealing your chips‘, that’s not really an accurate summary of the research at all. Clearly, a much better way to stop seagulls eating your food is not to eat in the vicinity of seagulls. (Doh!) But even aside from this nit-picking point, the research didn’t show that staring at seagulls stopped them ‘stealing your chips’. It showed that, on average, the seagulls that bother to steal your chips, do so more quickly when you are looking away. In other words, the headline should be:

If you insist on eating chips in the vicinity of seagulls, you’ll lose them quicker if you’re not looking at them

Guess that’s why I’m a statistician and not a journalist.

The issue of designed statistical experiments was something I also discussed in an earlier post. As I mentioned then, it’s an aspect of Statistics that, so far, hasn’t much been exploited in the context of sports modelling, where analyses tend to be based on historically collected data. But in the context of gambling, where different strategies for betting might be compared and contrasted, it’s likely to be a powerful approach. In that case, the issues of controlling for other variables – like the identity of the gambler or the stake size – and randomising to avoid biases will be equally important.

 

Data controversies

 

Some time ago I wrote about Mendel’s law of genetic inheritance, and how statistical analysis of Mendel’s data suggested his results were too good to be true. It’s not that his theory is wrong; it’s just that the data he provided as evidence for his theory seem to have been manipulated in such a way as to seem incontrovertible. Unfortunately the data lack the variation that Mendel’s own law would also imply should occur in measurements of that type, leading to the charge that the data had been manufactured or manipulated in some way.

Well, there’s a similar controversy about the picture at the top of this page.

The photograph, taken 100 years ago, was as striking at that time as the recent picture of a black hole, discussed in an earlier post, is today. However, this picture was taken with basic photographic equipment and telescopic lens and shows a total solar eclipse, as the moon passes directly between the Earth and the Sun.

A full story of the controversy is given here.

In summary: Einstein’s theory of general relativity describes gravity not as a force between two attracting masses – as is central to Newtonian physics – but as a curvature caused in space-time due to the presence of massive objects. All objects cause such curvature, but only those that are especially massive, such as stars and planets, will have much of an effect.

Einstein’s relativity model was completely revolutionary compared to the prevailing view of physical laws at the time. But although it explained various astronomical observations that were anomalous according to Newtonian laws, it had never been used to predict anomalous behaviour. The picture above, and similar ones taken at around the same time, changed all that.

In essence, blocking out the sun’s rays enabled dimmer and more distant stars to be accurately photographed. Moreover, if Einstein’s theory were correct, the photographic position of these stars should be slightly distorted because of the spacetime curvature effects of the sun. But the effect is very slight, and even Newtonian physics suggests some disturbance due to gravitational effects.

In an attempt to get photographic evidence at the necessary resolution, the British astronomer Arthur Eddington set up two teams of scientists – one on the African island of Príncipe, the other in Sobral, Brazil – to take photographs of the solar eclipse on 29 May, 1919. Astronomical and photographic equipment was much more primitive in those days, so this was no mean feat.

Anyway, to cut a long story short, a combination of poor weather conditions and other setbacks meant that the results were less reliable than were hoped for. It seems that the data collected at Príncipe, where Eddington himself was stationed, were inconclusive, falling somewhere between the Newton and Einstein model predictions. The data at Sobral were taken with two different types of telescope, with one set favouring the Newton view and the other Einstein’s. Eddington essentially combined the Einstein-favouring data from Sobral together with those from Príncipe and concluded that the evidence supported Einsteins relativistic model of the universe.

Now, in hindsight, with vast amounts of empirical evidence of many types, we know Einstein’s model to be fundamentally correct. But did Eddington selectively choose his data to support Einstein’s model?

There are different points of view, which hinge on Eddington’s motivation for dropping a subset of the Sobral data from his analysis. One point of view is that he wanted Einstein’s view to be correct, and therefore simply ignored the data that were less favourable. This argument is fuelled by political reasoning: it sarges that since Eddington was a Quaker, and therefore a pacifist, he wanted to support a German theory as a kind of post-war reconciliation.

The alternative point of view, for which there is some documentary evidence, is that the Sobral data which Eddington ignored had been independently designated as unreliable. Therefore, on proper scientific grounds, Eddington had behaved entirely correctly by excluding it from his analysis, and his subsequent conclusions favouring the Einstein model were entirely consistent with the scientific data and information he had available.

This issue will probably never be fully resolved, though in a recent review of several books on the matter, theoretical physicist Peter Coles (no relation) claims to have reanalysed the data given in the Eddington paper using modern statistical methods, and found no reason to doubt his integrity. I have no reason to doubt that point of view, but there’s no detail of the statistical analysis that was carried out.

What’s interesting though, from a statistical point of view, is how the interpretation of the results depends on the reason for the exclusion of a subset of the Sobral data. If your view is that Eddington knew their contents and excluded them on that basis, then his conclusions in favour of Einstein must be regarded as biased. If you accept that Eddington excluded these data a priori because of their unreliability, then his conclusions were fair and accurate.

Data are often treated as a neutral aspect of an analysis. But as this story illustrates, the choice of which data to include or exclude, and the reasons for doing so, may be factors which fundamentally alter the direction an analysis will take, and the conclusions it will reach.

 

 

 

1 in 562

Ever heard of the Fermi paradox? This phenomenon is named after the Italian physicist Enrico Fermi, and concerns the fact that though we’ve found no empirical evidence of extraterrestrial life, standard calculations based on our learned knowledge of the universe suggest that the probability of life elsewhere in our galaxy is very high. The theoretical side of the paradox is usually based on some variation of the Drake equation, which takes various known or estimated constants – like the number of observed stars in our galaxy, the estimated average number of planets per star, the proportion of these that are likely to be able to support life, and so on – and feeds them into an equation which calculates the expected number of alien civilisations in our galaxy.

Though there’s a lot of uncertainty about the numbers that feed into Drake’s equation, best estimates lead to an answer that suggests there should be millions of civilisations out there somewhere. And Fermi’s paradox points to the contrast between this number and the zero civilisations that we’ve actually observed.

Anyway, rather than try to go through any of this in greater detail, I thought I’d let this video do the explaining. And for fun, they suggest using the same technique to calculate the chances of you finding someone you are compatible with as a love partner.

Now, you probably don’t need me to explain all the limitations in this methodology, either for the evidence of alien life or for potential love partners with whom you are compatible. Though of course, the application to finding love partners is just for fun, right?

Well, yes and no. Here’s Rachel Riley of Countdown fame doing a barely-disguised publicity for eHarmony.

She uses pretty much the same methodology to show that you have…

… a 1 in 562 chance of finding love.

Rachel also gives some advice to help you improve those odds. First up:

… get to know your colleagues

<Smartodds!!! I know!!!>

But it’s maybe not as bad as it sounds; she’s suggesting your colleagues might have suitable friends for you to pair up with, rather than your colleagues being potential love-partners themselves.

Finally, I’ll let you think about whether the methodology and assumptions used in Rachel’s calculations make sense or not. And maybe even try to understand what the 1 in 562 answer actually means, especially as a much higher proportion of people actually do end up in relationships. The opposite of Fermi’s paradox!

By coincidence

dna2

In an earlier post I suggested we play a game. You’d pick a sequence of three outcomes of a coin toss, like THT. Then I’d pick a different triplet, say TTT. I’d then toss a coin repeatedly and whoever’s chosen triplet showed up first in the sequence would be the winner.

In the post I gave the following example…

H T H H H T T H T H …..

… and with that outcome and the choices above you’d have won since your selection of THT shows up starting on the 7th coin toss, without my selection of TTT showing up before.

The question I asked was who this game favoured. Assuming we both play as well as possible, does it favour

  1. neither of us, because we both get to choose and the game is symmetric? Or;
  2. you, because you get to choose first and have the opportunity to do so optimally? Or;
  3. me, because I get to see your choice before I have to make mine?

The answer turns out to be 3. I have a big advantage over you, if I play smart. We’ll discuss what that means in a moment.

But in terms of these possible answers, it couldn’t have been 2. Whatever you choose I could have chosen the exact opposite and by symmetry, since H and T are equally likely, our two triplets would have been equally likely to occur first in the sequence. So, if you choose TTT, I choose HHH. If you choose HHT, I choose TTH and so on. In this way I don’t have an advantage over you, but neither do you over me. So we can rule out 2 as the possible answer.

But I can actually play better than that and have an advantage over you, whatever choice you make. I play as follows:

  1. My first choice in the sequence is the opposite of your second choice.
  2. My second and third choices are equal to your first and second choices.

So, if you chose TTT, I would choose HTT. If you chose THT, I would choose TTH. And so on. It’s not immediately obvious why this should give me an advantage, but it does. And it does so for every choice you can make.

The complete set of selections you can make, the selections I will make in response, and the corresponding odds in my favour are given in the following table.

Your Choice My Choice My win odds
HHH THH 7:1
HHT THH 3:1
HTH HHT 2:1
HTT HHT 2:1
THH TTH 2:1
THT TTH 2:1
TTH HTT 3:1
TTT HTT 7:1

As you can see, your best choice is to go for any of HTH, HTT, THH, THT, but even then the odds are 2:1 in my favour. That’s to say, I’m twice as likely to win as you in those circumstances. My odds increase to 3:1 – I’ll win three times as often as you – if you choose HHT or TTH; and my odds are a massive 7:1 – I’ll win seven times as often as you – if you choose HHH or TTT.

So, even if you play optimally, I’ll win twice as often as you. But why should this be so? The probabilities aren’t difficult to calculate, but most are a little more complicated than I can reasonably include here. Let’s take the easiest example though. Suppose you choose HHH, in which I case I choose THH. I then start tossing the coins. It’s possible that HHH will be the first 3 coins in the sequence. That will happen with probability 1/2 x 1/2 x 1/2 =1/8. But if that doesn’t happen, then there’s no way you can win. Because the first time HHH appears in the sequence it will have had to have been preceded by a T (otherwise HHH has occurred earlier). In which case my THH occurs before your HHH. So, you would have won with probability 1/8, and therefore I win with probability 7/8, and my odds of winning are 7:1.

Like I say, the other cases – except when you choose TTT, which is identical to HHH, modulo switching H’s and T’s – are a little more complicated, but the principle is essentially the same every time.

By coincidence, this game was invented by Walter Penney (Penney -> penny, geddit?), who published it in the Journal of Recreational Mathematics in 1969. It’s interesting from a mathematical/statistical/game-theory point of view because it’s an example of a non-transitive game. For example, looking at the table above, HHT is inferior to THH; THH is inferior to TTH; TTH is inferior to HTT; and HTT is inferior to HHT. Which brings us full circle. So, there’s no overall optimal selection. Each can be beaten by another, which in turn can be beaten by a different choice again. This is why the second player has an advantage: they can always find a selection that will beat the first player’s. It doesn’t matter that their choice can also be beaten, because the first player has already made their selection and it wasn’t that one.

The best known example of a non-transitive game is Rock-Paper-Scissors. Rock beats scissors; scissors beats paper; paper beats rock. But in that case it’s completely deterministic – rock will always beat paper, for example. In the Penney coin tossing game, HTT will usually beat TTT, but occasionally it won’t. So, it’s perhaps better defined as a random non-transitive game.

The game also has connections with genetics. Strands of DNA are long chains composed of sequences of individual molecules known as nucleotides. Only four different types of nucleotide are found in DNA strands, and these are usually labelled A, T, G and C. The precise ordering of these nucleotides in the DNA strand effectively define a code that will determine the characteristics of the individual having that DNA.

It’s not too much of a stretch of the imagination to see that a long sequence of nucleotides A, T, G and C is not so very different – albeit with 4 variants, rather than 2 – from a sequence of outcomes just like those from the tosses of a coin. Knowing which combinations of the nucleotides are more likely to occur than others, and other combinatorial aspects of their arrangements, proves to be an important contribution of Statistics to the understanding of genetics and the development of genetic intervention therapies. Admittedly, it’s not a direct application of Penney’s game, but the statistical objectives and techniques required are not so very different.


Thanks to those of you who wrote to me with solutions to this problem, all of which were at least partially correct.

Woodland creatures

The hedgehog and the fox is an essay by philosopher Isaiah Berlin. Though published in 1993, the title is a reference to a fragment of a poem by the ancient Greek poet Archilochus. The relevant passage translates as:

… a fox knows many things, but a hedgehog one important thing.

Isaiah Berlin used this concept to classify famous thinkers: those whose ideas could be summarised by a single principle are hedgehogs; those whose ideas are more pragmatic, multi-faceted and evolving are foxes.

This dichotomy of approaches to thinking has more recently been applied in the context of prediction, and is the basis of the following short (less than 5-minute) video, kindly suggested to me by Richard.Greene@Smartodds.co.uk.

Watch and enjoy…

So, remarkably, in a study of the accuracy of individuals when making predictions, nothing made a difference: age, sex, political outlook… Except, ‘foxes’ are better predictors than ‘hedgehogs’: being well-versed in a single consistent philosophy is inferior to an adaptive and evolving approach to knowledge and its application.

The narrator, David Spiegelhalter, also summarises the strengths of a good forecaster as:

  1. Aggregation. They use multiple sources of information, are open to new knowledge and are happy to work in teams.
  2. Metacognition. They have an insight into how they think and the biases they might have, such as seeking evidence that simply confirms pre-set ideas.
  3. Humility. They have a willingness to acknowledge uncertainty, admit errors and change their minds. Rather than saying categorically what is going to happen, they are only prepared to give probabilities of future events.

(Could almost be a bible for a sports modelling company.)

These principles are taken from the book Future Babble by Dan Gardner, which looks like it’s a great read. The tagline for the book is ‘how to stop worrying and love the unpredictable’, which on its own is worth the cost of the book.


Incidentally, I could just have easily written a blog entry with David Spiegelhalter as part of my series of famous statisticians. Until recently he was the president of the Royal Statistical Society. He was also knighted in 2014 for his services to Statistics, and has numerous awards and honorary degrees.

His contributions to statistics are many, especially in the field of Medical Statistics.  Equally though, as you can tell from the above video, he is a fantastic communicator of statistical ideas. He also has a recent book out: The art of statistics: learning from data. I’d guess that if anyone wants to learn something about Statistics from a single book, this would be the place to go. I’ve just bought it, but haven’t read it yet. Once I do, if it seems appropriate, I’ll post a review to the blog.

Freddy’s story: part 1

This is a great story with a puzzle and an apparent contradiction at the heart of it, that you might like to think about yourself.

A couple of weeks ago Freddy.Teuma@smartodds.co.uk wrote to me to say that he’d been looking at the recent post which discussed a probability puzzle based on coin tossing, and had come across something similar that he thought might be useful for the blog. Actually, the problem Freddy described was based on an algorithm for optimisation using genetic mutation techniques, that a friend had contacted him about.

To solve the problem, Freddy did four smart things:

  1. He first simplified the problem to make it easier to tackle, while still maintaining its core elements;
  2. He used intuition to predict what the solution would be;
  3. He supported his intuition with mathematical formalism;
  4. He did some simulations to verify that his intuition and mathematical reasoning were correct.

This is exactly how a statistician would approach both this problem and problems of greater complexity.

However… the pattern of results Freddy observed in the simulations contradicted what his intuition and mathematics had suggested would happen, and so he adjusted his beliefs accordingly. And then he wrote to me.

This is the version of the problem that Freddy had simplified from the original…

Suppose you start with a certain amount of money. For argument’s sake, let’s say it’s £100. You then play several rounds of a game. At each round the rules are as follows:

  1. You toss a fair coin (Heads and Tails each have probability 1/2).
  2. If the coin shows Heads, you lose a quarter of your current amount of money and end up with 3/4 of what you had at the start of the round.
  3. If the coin shows Tails, you win a quarter of your current amount of money and end up with 5/4 of what you had at the start of the round.

For example, suppose your first 3 tosses of the coin are Heads, Tails, Heads. The money you hold goes from £100, to £75 to £93.75 to £70.3125.

Now, suppose you play this game for a large number of rounds. Again, for argument’s sake, let’s say it’s 1000 rounds. How much money do you expect to have, on average, at the end of these 1000 rounds?

Have a think about this game yourself, and see what your own intuition suggests before scrolling down.

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

Freddy’s reasoning was as follows. In each round of the game I will lose or gain 25% of my current amount of money with equal probability. So, if I currently have £100, then at the end of the next round I will have either £75 or £125 with equal probability. And the average is still £100. This reasoning is true at each round of the game. And so, after any number of rounds, including 1000, I’d expect to have exactly the same amount of money as when I started: £100.

But when Freddy simulated the process, he found a different sort of behaviour. In each of his simulations, the money held after 1000 rounds was very close to zero, suggesting that the average is much smaller than £100.

I’ve taken the liberty of doing some simulations myself: the pattern of results in 16 repeats of the game, each time up to 1000 rounds,  is shown in the following figure.

Each panel of the figure corresponds to a repeat of the game, and in each repeat I’ve plotted a red trace showing how much money I hold after each round of the game.  In each case you can see that I start with £100, there’s then a bit of oscillation – more in some of the realisations than in others, due to random variation – but in all cases the amount of money I have hits something very close to zero somewhere before 250 rounds and then stays there right up to 1000 rounds.

So, there is indeed a conflict between Freddy’s intuition and the picture that these simulations provide.

What’s going on?

I’ll leave you to think about it for a while, and write with my own explanation and discussion of the problem in a future post. If you’d like to write to me to explain what you think is happening, I’d be very happy to hear from you.

Obviously, I’m especially grateful to Freddy for having sent me the problem in the first place, and for agreeing to let me write a post about it.


Update: if you’d like to run the simulation exercise yourself, just click the ‘run’ button in the following window. This will simulate the game for 1000 rounds, starting with £100. The graph will show you how much money you hold after each round of the game, while if you toggle to the console window it will tell you how much money you have after the 1000th round (to the nearest £0.01). This may not work in all browsers, but seems to work ok in Chrome. You can repeat the experiment simply by clicking ‘Run’ again. You’re likely to get a different graph each time because of the randomness in the simulations. But what about the final amount? Does that also change? And what does it suggest about Freddy’s reasoning that the average amount should stay equal to £100?

game_sim<-function(n_rounds=1000, money_start=100){ require(ggplot2) money<-c() money[1]<-money_start for(i in 2:(n_rounds)){ money[i]<-money[i-1]*sample(c(.75,1.25),1) } m<-data.frame(round=1:n_rounds,money=money) cat('Money in pounds after ',n_rounds, ' rounds is ',round(money[n_rounds],2)) ggplot(aes(x=round,y=money),data=m)+geom_line(color='red')+ ggtitle('Money') } game_sim()

A bad weekend

Had a bad weekend? Maybe your team faded against relegated-months-ago Huddersfield Town, consigning your flickering hopes of a Champions League qualification spot to the wastebin. Or maybe you support Arsenal.

Anyway, Smartodds loves Statistics is here to help you put things in perspective: ‘We are in trouble‘. But not trouble in the sense of having to play Europa League qualifiers on a Thursday night. Trouble in the sense that…

Human society is under urgent threat from loss of Earth’s natural life

Yes, deep shit trouble.

This is according to a Global Assessment report by the United Nations, based on work by hundreds of scientists who compiled as many as 15,000 academic studies. Here are some of the headline statistics:

  • Nature is being destroyed at a rate of tens to hundreds of times greater than the average over the last 10 million years;
  • The biomass of wild mammals has fallen by 82%;
  • Natural ecosystems have lost around half of their area;
  • A million species are at risk of extinction;
  • Pollinator loss has put up to £440 billion of crop output at risk;

The report goes on to say:

The knock-on impacts on humankind, including freshwater shortages and climate instability, are already “ominous” and will worsen without drastic remedial action.

But if only we could work out what the cause of all this is. Oh, hang on, the report says it’s…

… all largely as a result of human actions.

For example, actions like these:

  • Land degradation has reduced the productivity of 23% of global land;
  • Wetlands have drained by 83% since 1700;
  • In the years 2000-2013 the area of intact forest fell by 7% – an area the size of France and the UK combined;
  • More than 80% of wastewater, as well as 300-400m tons of industrial waste, is pumped back into natural water reserves without treatment;
  • Plastic waste is a factor of tens greater than in 1980, affecting 86% of marine turtles, 44% of seabirds and 43% of marine animals.
  • Fertiliser run-off has created 400 dead zones – an area the size of the UK.

You probably don’t need to be a bioscientist and certainly not a statistician to realise none of this is particularly good news. However, the report goes on to list various strategies that agencies, governments and countries need to adopt in order to mitigate against the damage that has already been done and minimise the further damage that will unavoidably be done under current regimes.  But none of it’s easy, and evidence so far is not in favour of collective human will to accept the responsibilities involved.

Josef Settele of the Helmholtz Centre for Environmental Research in Germany said

People shouldn’t panic, but they should begin drastic change. Business as usual with small adjustments won’t be enough.

So, yes, cry all you like about Liverpool’s crumbling hopes for a miracle against Barcelona tonight, but keep it in perspective and maybe even contribute to the wider task of saving humanity from itself.

<End of rant. Enjoy tonight’s game.>


Correction: *Bareclona’s* crumbling hopes

Picture this

You can’t help but be amazed at the recent release of the first ever genuine image of a black hole. The picture itself, and the knowledge of what it represents, are extraordinary enough, but the sheer feat of human endeavour that led to this image is equally breathtaking.

Now, as far as I can see from the list of collaborators that are credited with the image, actual designated statisticians didn’t really contribute. But, from what I’ve read about the process of the image’s creation, Statistics is central to the underlying methodology. I don’t understand the details, but the outline is something like this…

Although black holes are extremely big, they’re also a long way away. This one, for example, has a diameter that’s bigger than our entire solar system. But it’s also at the heart of the Messier 87 galaxy, some 55 million light years away from Earth. Which means that when looking towards it from Earth, it occupies a very small part of space. The analogy that’s been given is that capturing the black hole’s image in space would be equivalent to trying to photograph a piece of fruit on the surface of the moon. And the laws of optics imply this would require a telescope the size of our whole planet.

To get round this limitation, the Event Horizon Telescope (EHT) program uses simultaneous signals collected from a network of eight powerful telescopes stationed around the Earth. However, the result, naturally, is a sparse grid of signals rather than a complete image. The rotation of the earth means that with repeat measurements this grid gets filled-out a little. But still, there’s a lot of blank space that needs to be filled-in to complete the image. So, how is that done?

In principle, the idea is simple enough. This video was made some years ago by Katie Bouman, who’s now got worldwide fame for leading the EHT program to produce the black hole image:

The point of the video is that to recognise the song, you don’t need the whole keyboard to be functioning. You just need a few of the keys to be working – and they don’t even have to be 100% precise – to be able to identify the whole song. I have to admit that the efficacy of this video was offset for me by the fact that I got the song wrong, but in the YouTube description of the video, Katie explains this is a common mistake, and uses the point to illustrate that with insufficient data you might get the wrong answer. (I got the wrong answer with complete data though!)

In the case of the music video, it’s our brain that fills in the gaps to give us the whole tune. In the case of the black hole data, it’s sophisticated and clever picture imaging techniques, that rely on the known physics of light transmission and a library of the patterns found in images of many different types. From this combination of physics and library of image templates, it’s possible to extrapolate from the observed data to build proposal images, and for each one find a score of how plausible that image is. The final image is then the one that has the greatest plausibility score. Engineers call this image reconstruction; but the algorithm is fundamentally statistical.

At least, that’s how I understood things. But here’s Katie again giving a much  better explanation in a Ted talk:

Ok, so much for black holes. Now, think of:

  1. Telescopes as football matches;
  2. Image data as match results;
  3. The black hole as a picture that contains information about how good football teams really are;
  4. Astrophysics as the rules by which football matches are played;
  5. The templates that describe how an image changes from one pixel to the next as a rule for saying how team performances might change from one game to the next.

And you can maybe see that in a very general sense, the problem of reconstructing an image of a black hole has the same elements as that of estimating the abilities of football teams. Admittedly, our football models are rather less sophisticated, and we don’t need to wait for the end of the Antarctic winter to ship half a tonne of hard drives containing data back to the lab for processing. But the principles of Statistics are generally the same in all applications, from black hole imaging to sports modelling, and everything in between.

Famous statisticians: Sir Francis Galton

 

 

This is the second in a so-far very short series on famous statisticians from history. You may remember that the first in the series was on John Tukey. As I said at that time, rather than just include statisticians randomly in this series, I’m going to focus on those who have had an impact beyond the realm of just statistics.

With that in mind, this post is about Sir Francis Galton (1822-1911), an English statistician who did most of his work in the second half of the 19th century, around the time that Statistics was being born as a viable scientific discipline.

You may remember seeing Galton’s name recently. In a recent post on the bean machine, I mentioned that the device also goes under the name of ‘Galton board’. This is because Galton was the inventor of the machine, which he used to illustrate the Central Limit Theorem, as discussed in the earlier post. You may also remember an earlier post in which I discussed `regression to the mean’; Galton was also the first person to explore and describe this phenomenon, as well as the more general concept of correlation to describe the extent to which two random phenomena are connected.

It’s probably no coincidence that Galton was a half-cousin of Charles Darwin, since much of Galton’s pioneering work was on the way statistics could be used to understand genetic inheritance and human evolution. Indeed, he is the inventor of the term eugenics, which he coined during his attempts to understand the extent to which intelligence is inherited, rather than developed.

Galton is described in Wikipedia as:

  • A statistician
  • A progressive
  • A polymath
  • A sociologist
  • A psychologist
  • An anthropologist
  • A eugenicist
  • A tropical explorer
  • A geographer
  • An inventor
  • A meteorologist
  • A proto-geneticist
  • A psychometrician

And you thought you were busy. Anyway, it’s fair to say that Galton falls in my category of statisticians who have done something interesting with their lives outside of Statistics.

His various contributions apart from those mentioned above include:

  1. He invented the use of weather maps for popular use;
  2. He wrote a book ‘The Art of Travel’ which offered practical travel advice to Victorians;
  3. He was the first to propose the use of questionnaires as a means of data collection;
  4. He conceived the notion of standard deviation as a way of summarising the variation in data;
  5. He devised a technique called composite portraiture which was an early version of photoshop for making montages of photographic portraits;
  6. He pretty much invented the technique of fingerprinting for identifying  individuals by their fingerprints.

In summary, many of the things Galton worked on or invented are still relevant today. And this is just as true for his non-statistical contributions, as for his statistical ones. Of course, it’s an unfortunate historical footnote that his theory of eugenics – social engineering to improve biological characteristics in populations – was adopted and pushed to extremes in Nazi Germany, with unthinkable consequences.

In retrospect, it’s a pity he didn’t just stop once he’d invented the bean machine.

 

The gene genie

One of the most remarkable advances in scientific understanding over the last couple of hundred years has been Mendelian genetics. This theory explains the basics of genetic inheritance, and is named after its discoverer, Gregor Mendel, who developed the model based on observations of the characteristics of peas when cross-bred from different varieties. In his most celebrated experiment, he crossed pure yellow with pure green peas, and obtained a generation consisting of only yellow peas. But in the subsequent generation, when these peas were crossed, he obtained a mixed generation of yellow and green peas. Mendel constructed the theory of genes and alleles to explain this phenomenon, which subsequently became the basis of modern genetic science.

You probably know all this anyway, but if you’re interested and need a quick reminder, here’s a short video giving an outline of the theory.

Mendel’s pea experiment was very simple, but from the model he developed he was able to calculate the proportion of peas of different varieties to be expected in subsequent generations. For example, in the situation described above, the theory suggests that there would be no green peas in the first generation, but around 1/4 of the peas in the second generation would be expected to be green.

Mendel’s theory extends to more complex situations; in particular it allows for the inheritance of multiple characteristics. In the video, for example, the characteristic for peas to be yellow/green is supplemented by their propensity to be round/wrinkled. Mendel’s model leads to predictions of the proportion of peas in each generation when stratified  by both these characteristics: round and green, or yellow and wrinkled etc etc.

The interesting thing from a statistical point of view is the way Mendel verified his theory. All scientific theories go through the same validation process: first there are some observations; second those observations lead to a theory; and third there is a detailed scrutiny of further observations to ensure that they are consistent with the theory. If they are, then the theory stands, at least until there are subsequent observations which violate the theory, or a better theory is developed to replace the original.

Now, where there is randomness in the observations, the procedure of ensuring that the observations are in agreement with the theory is more complicated. For example, consider the second generation of peas in the experiment above. The theory suggests that, on average, 1/4 of the peas should be green. So if we take 100 peas from the second generation, we’d expect around 25 of them to be green. But that’s different from saying exactly 25 should be green. Is it consistent with the theory if we get 30 green peas? Or 40? At what point do we decide that the experimental evidence is inconsistent with the theory? This is the substance of Statistics.

Actually, the theory of Mendelian inheritance can be expressed entirely in terms of statistical models. There is a specific probability that certain characteristics are passed on from parents to offspring, and this leads to expected proportions of different types in subsequent generations. And expressed this way, we don’t just learn that 1/4 of second generation peas should be green, but also the probability that in a sample of 100 we get 30, 40 or any number of green peas.

And this leads to something extremely interesting: Mendel’s experimental results are simply too good to be true. For example – though I’m actually making the numbers up here – in repeats of the simple pea experiment he almost always got something very close to 25 green peas out of 100. As explained above, the statistics behind Mendelian inheritance do indeed say that he should have got an average of 25 per population. But the same theory also implies that 20 or 35 green peas out of 100 are entirely plausible, and indeed a spread of experimental results between 20 and 35 is to be expected. But, each of Mendel’s experiments gave a number very close to 25. Ironically, if these really were the experimental results, they would be in violation of the theory, which expects not just an average of 25, but with an appropriate amount of variation around that figure.

So, Mendel’s experimental results were actually a primitive example of fake news. But here’s the thing: Mendel’s theory has subsequently been shown to be correct, even if it seems likely that the evidence he presented had been manipulated to strengthen its case. In modern parlance, Mendel focused on making sure his results supported the predicted average, but failed to appreciate that the theory also implied something about the variation in observations. So even if the experimental results were fake news, the theory itself has been shown to be anything but fake.

To be honest, there is some academic debate about whether Mendel cheated or not. As far as I can tell though, this is largely based on the assumption that since he was also a monk and a highly-regarded scientist, cheating would have been out of character. Nobody really denies the fact that the statistics really are simply too good to be true. Of course, in the end, it really is all academic, as the theory has been proven to be correct and is the basis for modern genetic theory. If interested, you can follow the story a little further here.


Incidentally, the fact that statistical models speak about variation as well as about averages is essential to the way they get used in sports modelling. In football, for example, models are generally estimated on the basis of the average number of goals a team is expected to score. But the prediction of match scores as a potential betting aid requires information about the variation in the number of goals around the average value. And though Mendel seems not to have appreciated the point, a statistical model contains information on both averages and variation, and if a model is to be suitable for data, the data will need to be consistent with the model in terms of both aspects.