1 in 562

Ever heard of the Fermi paradox? This phenomenon is named after the Italian physicist Enrico Fermi, and concerns the fact that though we’ve found no empirical evidence of extraterrestrial life, standard calculations based on our learned knowledge of the universe suggest that the probability of life elsewhere in our galaxy is very high. The theoretical side of the paradox is usually based on some variation of the Drake equation, which takes various known or estimated constants – like the number of observed stars in our galaxy, the estimated average number of planets per star, the proportion of these that are likely to be able to support life, and so on – and feeds them into an equation which calculates the expected number of alien civilisations in our galaxy.

Though there’s a lot of uncertainty about the numbers that feed into Drake’s equation, best estimates lead to an answer that suggests there should be millions of civilisations out there somewhere. And Fermi’s paradox points to the contrast between this number and the zero civilisations that we’ve actually observed.

Anyway, rather than try to go through any of this in greater detail, I thought I’d let this video do the explaining. And for fun, they suggest using the same technique to calculate the chances of you finding someone you are compatible with as a love partner.

Now, you probably don’t need me to explain all the limitations in this methodology, either for the evidence of alien life or for potential love partners with whom you are compatible. Though of course, the application to finding love partners is just for fun, right?

Well, yes and no. Here’s Rachel Riley of Countdown fame doing a barely-disguised publicity for eHarmony.

She uses pretty much the same methodology to show that you have…

… a 1 in 562 chance of finding love.

Rachel also gives some advice to help you improve those odds. First up:

… get to know your colleagues

<Smartodds!!! I know!!!>

But it’s maybe not as bad as it sounds; she’s suggesting your colleagues might have suitable friends for you to pair up with, rather than your colleagues being potential love-partners themselves.

Finally, I’ll let you think about whether the methodology and assumptions used in Rachel’s calculations make sense or not. And maybe even try to understand what the 1 in 562 answer actually means, especially as a much higher proportion of people actually do end up in relationships. The opposite of Fermi’s paradox!

By coincidence

dna2

In an earlier post I suggested we play a game. You’d pick a sequence of three outcomes of a coin toss, like THT. Then I’d pick a different triplet, say TTT. I’d then toss a coin repeatedly and whoever’s chosen triplet showed up first in the sequence would be the winner.

In the post I gave the following example…

H T H H H T T H T H …..

… and with that outcome and the choices above you’d have won since your selection of THT shows up starting on the 7th coin toss, without my selection of TTT showing up before.

The question I asked was who this game favoured. Assuming we both play as well as possible, does it favour

  1. neither of us, because we both get to choose and the game is symmetric? Or;
  2. you, because you get to choose first and have the opportunity to do so optimally? Or;
  3. me, because I get to see your choice before I have to make mine?

The answer turns out to be 3. I have a big advantage over you, if I play smart. We’ll discuss what that means in a moment.

But in terms of these possible answers, it couldn’t have been 2. Whatever you choose I could have chosen the exact opposite and by symmetry, since H and T are equally likely, our two triplets would have been equally likely to occur first in the sequence. So, if you choose TTT, I choose HHH. If you choose HHT, I choose TTH and so on. In this way I don’t have an advantage over you, but neither do you over me. So we can rule out 2 as the possible answer.

But I can actually play better than that and have an advantage over you, whatever choice you make. I play as follows:

  1. My first choice in the sequence is the opposite of your second choice.
  2. My second and third choices are equal to your first and second choices.

So, if you chose TTT, I would choose HTT. If you chose THT, I would choose TTH. And so on. It’s not immediately obvious why this should give me an advantage, but it does. And it does so for every choice you can make.

The complete set of selections you can make, the selections I will make in response, and the corresponding odds in my favour are given in the following table.

Your Choice My Choice My win odds
HHH THH 7:1
HHT THH 3:1
HTH HHT 2:1
HTT HHT 2:1
THH TTH 2:1
THT TTH 2:1
TTH HTT 3:1
TTT HTT 7:1

As you can see, your best choice is to go for any of HTH, HTT, THH, THT, but even then the odds are 2:1 in my favour. That’s to say, I’m twice as likely to win as you in those circumstances. My odds increase to 3:1 – I’ll win three times as often as you – if you choose HHT or TTH; and my odds are a massive 7:1 – I’ll win seven times as often as you – if you choose HHH or TTT.

So, even if you play optimally, I’ll win twice as often as you. But why should this be so? The probabilities aren’t difficult to calculate, but most are a little more complicated than I can reasonably include here. Let’s take the easiest example though. Suppose you choose HHH, in which I case I choose THH. I then start tossing the coins. It’s possible that HHH will be the first 3 coins in the sequence. That will happen with probability 1/2 x 1/2 x 1/2 =1/8. But if that doesn’t happen, then there’s no way you can win. Because the first time HHH appears in the sequence it will have had to have been preceded by a T (otherwise HHH has occurred earlier). In which case my THH occurs before your HHH. So, you would have won with probability 1/8, and therefore I win with probability 7/8, and my odds of winning are 7:1.

Like I say, the other cases – except when you choose TTT, which is identical to HHH, modulo switching H’s and T’s – are a little more complicated, but the principle is essentially the same every time.

By coincidence, this game was invented by Walter Penney (Penney -> penny, geddit?), who published it in the Journal of Recreational Mathematics in 1969. It’s interesting from a mathematical/statistical/game-theory point of view because it’s an example of a non-transitive game. For example, looking at the table above, HHT is inferior to THH; THH is inferior to TTH; TTH is inferior to HTT; and HTT is inferior to HHT. Which brings us full circle. So, there’s no overall optimal selection. Each can be beaten by another, which in turn can be beaten by a different choice again. This is why the second player has an advantage: they can always find a selection that will beat the first player’s. It doesn’t matter that their choice can also be beaten, because the first player has already made their selection and it wasn’t that one.

The best known example of a non-transitive game is Rock-Paper-Scissors. Rock beats scissors; scissors beats paper; paper beats rock. But in that case it’s completely deterministic – rock will always beat paper, for example. In the Penney coin tossing game, HTT will usually beat TTT, but occasionally it won’t. So, it’s perhaps better defined as a random non-transitive game.

The game also has connections with genetics. Strands of DNA are long chains composed of sequences of individual molecules known as nucleotides. Only four different types of nucleotide are found in DNA strands, and these are usually labelled A, T, G and C. The precise ordering of these nucleotides in the DNA strand effectively define a code that will determine the characteristics of the individual having that DNA.

It’s not too much of a stretch of the imagination to see that a long sequence of nucleotides A, T, G and C is not so very different – albeit with 4 variants, rather than 2 – from a sequence of outcomes just like those from the tosses of a coin. Knowing which combinations of the nucleotides are more likely to occur than others, and other combinatorial aspects of their arrangements, proves to be an important contribution of Statistics to the understanding of genetics and the development of genetic intervention therapies. Admittedly, it’s not a direct application of Penney’s game, but the statistical objectives and techniques required are not so very different.


Thanks to those of you who wrote to me with solutions to this problem, all of which were at least partially correct.

An Uberlord from the world of football

jigazo

Ok, this isn’t strictly Statistics, but I came across it while researching for another post and it seemed fun, so I thought I’d share it.

JiGaZo is a 300-piece jigsaw puzzle with a difference. The pieces are identically-shaped, 90-degree rotationally symmetric, sepia-coloured (to different degrees of shading) and have a colour-coded symbol on the back. You take a picture of yourself or anyone else. You upload that picture to your computer, run it through the software provided, and it spits out a grid of the codes on the back of the jigsaw pieces. You then construct the jigsaw following those coded instructions, and the result is a jigsaw reproduction of the image you uploaded.

Perhaps it’s better explained by the following ad:

You can get JiGaZo for around a tenner at Amazon. Just picture your loved one’s face on Christmas Day when they realise that not only have you constructed a 300-piece jigsaw of them as a present, but they can also disassemble it and return the favour to you in time for Boxing Day.

Christmas shopping ideas: just one of the services provided by Smartodds loves Statistics.


Now, to give this post some relevance to Statistics, we might ask how many unique images can be made with a JiGaZo set? These are the types of calculations we often have to make when enumerating probabilities in all sorts of statistical problems.

Have a quick guess at what the answer might be before scrolling down…

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

So, first we can calculate the number of ways the 300 pieces can be placed in the grid. There are 300 choices for the first piece; that leaves 299 for the second; then 298 for the third; and so on. Since each of these choices combines with each of the others, the total number of such arrangements is

300 \times 299 \times 298 \times   .... \times 3 \times 2 \times 1

By convention this is written 300! and it’s HUGE: approximately the number 3 followed by 614 zeros.

To put that in perspective, it’s believed the number of stars in the observable universe is ‘only’ 1 followed by 21 zeros. So, if every star had its own universe, and every star in that universe had its own universe, and every star in that universe had its own universe and we kept doing that – putting a universe on every new star – a total of around 29 times, then we’d have about 300! stars in total.

But that’s not all. For every one of those 300! arrangements of the Jigazo pieces, each piece can be rotated in 4 different ways. So we have to multiply 300! by 4 three hundred times. Finally we divide that answer by 2 since any arrangement is arguably the same if it’s upside-down – I can just rotate it and get the same picture.

So, the final answer is

300! \times 4^{300} /2,

which is roughly 6 followed by 794 zeros. So, apply that universe on every star procedure another 8 times or so, and you get close to the number of unique JiGaZo images.

Since even the fastest computer in the world would take much longer than the age of the universe to run through all of those possibilities, you start to realise that the software that comes with JiGaZo, which aims to find a pretty good match for any input, must be a smart piece of image mapping.

But crucially… how well does it work? I guess the answer to that is determined by how easily you can recognise the following Uberlord from the world of football…

 

Woodland creatures

The hedgehog and the fox is an essay by philosopher Isaiah Berlin. Though published in 1993, the title is a reference to a fragment of a poem by the ancient Greek poet Archilochus. The relevant passage translates as:

… a fox knows many things, but a hedgehog one important thing.

Isaiah Berlin used this concept to classify famous thinkers: those whose ideas could be summarised by a single principle are hedgehogs; those whose ideas are more pragmatic, multi-faceted and evolving are foxes.

This dichotomy of approaches to thinking has more recently been applied in the context of prediction, and is the basis of the following short (less than 5-minute) video, kindly suggested to me by Richard.Greene@Smartodds.co.uk.

Watch and enjoy…

So, remarkably, in a study of the accuracy of individuals when making predictions, nothing made a difference: age, sex, political outlook… Except, ‘foxes’ are better predictors than ‘hedgehogs’: being well-versed in a single consistent philosophy is inferior to an adaptive and evolving approach to knowledge and its application.

The narrator, David Spiegelhalter, also summarises the strengths of a good forecaster as:

  1. Aggregation. They use multiple sources of information, are open to new knowledge and are happy to work in teams.
  2. Metacognition. They have an insight into how they think and the biases they might have, such as seeking evidence that simply confirms pre-set ideas.
  3. Humility. They have a willingness to acknowledge uncertainty, admit errors and change their minds. Rather than saying categorically what is going to happen, they are only prepared to give probabilities of future events.

(Could almost be a bible for a sports modelling company.)

These principles are taken from the book Future Babble by Dan Gardner, which looks like it’s a great read. The tagline for the book is ‘how to stop worrying and love the unpredictable’, which on its own is worth the cost of the book.


Incidentally, I could just have easily written a blog entry with David Spiegelhalter as part of my series of famous statisticians. Until recently he was the president of the Royal Statistical Society. He was also knighted in 2014 for his services to Statistics, and has numerous awards and honorary degrees.

His contributions to statistics are many, especially in the field of Medical Statistics.  Equally though, as you can tell from the above video, he is a fantastic communicator of statistical ideas. He also has a recent book out: The art of statistics: learning from data. I’d guess that if anyone wants to learn something about Statistics from a single book, this would be the place to go. I’ve just bought it, but haven’t read it yet. Once I do, if it seems appropriate, I’ll post a review to the blog.

Favourtism

Let’s play a game. I’ve got a coin here, and I’m going to toss it repeatedly and record the sequence of outcomes: heads (H) and tails (T).

Here we go…

H T H H H T T H T H …..

That was fun.

Next, I’ll do that again, but before doing so I’ll ask you to make a prediction for a sequence of 3 tosses. Then I’ll do the same, making a different choice. So you might choose THT and then I might choose TTT. I’ll then start tossing the coin. The winner will be the person whose chosen triplet shows up first in the sequence of tosses.

So, if the coin showed up as in the sequence above, you’d have won because there’s a sequence of THT starting from the 7th toss in the sequence. If the triplet TTT had shown up before that – which it didn’t -then I’d have won.

Now, assuming we both play optimally, there are 3 possibilities for who this game might favour (in the sense of having a higher probability of winning):

  1. It favours no one. We both get to choose our preferred sequence and so, by symmetry, our chances of winning are equal.
  2. It favours you. You get to choose first and so you can make the optimal choice before I get a chance.
  3. It favours me. I get to see your choice before making mine and can make an optimal choice accordingly.

Which of these do you think is correct? Have a think about it. You might even have to decide what it means to play ‘optimally’.

If you’d like to mail me with your answers I’d be happy to hear from you. In a subsequent post I’ll discuss  the solution with reasons why this game is important.

Tennis puzzles

They’re not especially statistical, and not especially difficult, but I thought you might like to try some tennis-based puzzle questions. I’ve mentioned before that Alex Bellos has a fortnightly column in the Guardian where he presents mathematical puzzles of one sort or another. Well, to coincide with the opening of Wimbledon, today’s puzzles have a tennis-based theme. You can find them here.

I think they’re fairly straightforward, but in any case, Alex will be posting the solutions later today if you want to check your own answers.

I say they’re not especially statistical, but there is quite a lot of slightly intricate probability associated with tennis, since live tennis betting is a lucrative market these days. Deciding whether a bet is good value or not means taking the current score and an estimate of the players’ relative abilities, and converting that into a match win probability for either player, which can then be compared against the bookmakers’ odds. But how is that done? The calculations are reasonably elementary, but complicated by both the scoring system and the fact that players tend to be more likely to win a point on serve than return.

If you’re interested, the relevant calculations for all score situations are available in this academic paper, though this assumes players are equally strong on serve and return. It also assumes the outcome of each point is statistically independent from all other points – that’s to say, knowing the outcome of one point doesn’t affect the probability of who wins another point. So, to add to Alex’s 3 questions above, I might add:

Why might tennis points not be statistically independent in practice, and what is the likely effect on match probability calculations of assuming they are when they’re not?

Walking on water

Here’s a question: how do you get dogs to walk on water?

Turns out there’s a really simple answer – just heat the atmosphere up by burning fossil fuels so much that the Greenland ice sheets melt.

The remarkable picture above was taken by a member of the Centre for Ocean and Ice at the Danish Meteorological Institute. Their pre-summer retrieval of research equipment is normally a sledge ride across a frozen winter wasteland; this year it was a paddle through the ocean that’s sitting on what’s left of the ice. And the husky dogs that pull the sledge are literally walking on water.

This graph shows the extent – please note: clever play on words – of the problem…

The blue curve shows the median percentage of Greenland ice melt over the last few decades. There’s natural year-to-year variation around that average, and as with any statistical analysis, it’s important to understand what types of variations are normal before deciding whether any particular observation is unusual or not. So, in this case, the dark grey area shows the range of values were observed in 50% of years; the light grey area is what was observed in 90% of years. So, you’d only expect observations outside the light grey area once every ten years. Moreover, the further an observation falls outside of the grey area, the more anomalous it is.

Now, look at the trace for 2019 shown in red. The value for June isn’t just outside the normal range of variation, it’s way outside. And it’s not only an unusually extreme observation for June; it would be extreme for the hottest part of the year in July. At it’s worst (so far), the melt for June 2019 reached over 40%, whereas the average in mid-July is around 18%, with a value of about 35% being exceeded only once in every 10 years.

So, note how much information can be extracted from a single well-designed graph. We can see:

  1. The variation across the calendar of the average ice melt;
  2. The typical variation around the average – again across the calendar – in terms of an interval expected to contain the true value on 50% of occasions: the so-called inter-quartile range;
  3. A more extreme measure of variation, showing the levels that are exceeded only once every 10 years: the so-called inter-decile range;
  4. The trace of an individual year – up to current date – which appears anomalous.

In particular, by showing us the variation in ice melt both within years and across years we were able to conclude that this year’s June value is truly anomalous.

Now let’s look at another graph. These are average spring temperatures, not for Greenland but for Alaska, where there are similar concerns about ice melt caused by increased atmospheric temperatures.

alaska

Again, there’s a lot of information:

  1. Each dot is an average spring temperature, one per year;
  2. The dots have been coloured: most are black, but the blue and red ones correspond to the ten coldest and hottest years respectively;
  3. The green curve shows the overall trend;
  4. The value for 2019 has been individually identified.

And the picture is clear. Not only has the overall trend been increasing since around the mid-seventies, but almost all of the hottest years have occurred in that period, while almost none of the coldest have. In other words, the average spring temperature in Alaska has been increasing over the last 50 years or so, and is hotter now than it has been for at least 90 years (and probably much longer).

Now, you don’t need to be a genius in biophysics to understand the cause and effect relating temperature and ice. So the fact that extreme ice melts are occurring in the same period as extreme temperatures is hardly surprising. What’s maybe less well-known is that the impact of these changes has a knock-on effect way beyond the confines of the Arctic.

So, even if dogs walking on the water of the arctic oceans seems like a remote problem, it’s part of a chain of catastrophic effects that will soon affect our lives too. Statistics has an important role to play in determining and communicating the presence and cause of these effects, and the better we all are at understanding those statistics, the more likely we will be able to limit the damage that is already inevitable. Fortunately, our governments are well aware of this and are taking immediate actions to remedy the problem.

Oh, wait…

… scrap that, better take action ourselves.

First pick

Zion Williamson

If you follow basketball you’re likely to know that the NBA draft was held this weekend, resulting in wonderkid Zion Williamson being selected for New Orleans Pelicans. The draft system is a procedure by which newly available players are distributed among the various NBA teams.

Unlike most team sports at professional level in Europe, the draft system is a partial attempt to balance out teams in terms of the quality of their players. Specifically, teams that do worse one season are given preference when choosing players for the next season. It’s a slightly archaic and complicated procedure – which is shorthand for saying I couldn’t understand all the details from Wikipedia – but the principles are simple enough.

There are 3 stages to the procedure:

  1. A draft lottery schedule, in which teams are given a probability of having first pick, second pick and so on, based on their league position in the previous season. Only teams below a certain level in the league are permitted to have the first pick,  and the probabilities allocated to each team are inversely related to their league position. In particular, the lowest placed teams have the highest probability of getting first pick.
  2. The draft lottery itself, held towards the end of May, where the order of pick selections are assigned randomly to the teams according to the probabilities assigned in the schedule.
  3. The draft selection, held in June, where teams make their picks in the order that they’ve been allocated in the lottery procedure.

In the 2019 draft lottery, the first pick probabilities were assigned as follows:

nbapick

So, the lowest-placed teams, New York, Cleveland and Phoenix, were all given a 14% chance, down to Charlotte, Miami and Sacramento who were given a 1% chance. The stars and other indicators in the table are an additional complication arising from the fact that teams can trade their place in the draw from one season to another.

In the event, following the lottery based on these probabilities, the first three picks were given to New Orleans, Memphis and New York respectively. The final stage in the process was then carried out this weekend, resulting in the anticipated selection of Zion Williamson by the New Orleans Pelicans.

There are several interesting aspects to this whole process from a statistical point of view.

The first concerns the physical aspects of the draft lottery. Here’s an extract from the NBA’s own description of the procedure:

Fourteen ping-pong balls numbered 1 through 14 will be placed in a lottery machine. There are 1,001 possible combinations when four balls are drawn out of 14, without regard to their order of selection. Before the lottery, 1,000 of those 1,001 combinations will be assigned to the 14 participating lottery teams. The lottery machine is manufactured by the Smart Play Company, a leading manufacturer of state lottery machines throughout the United States. Smart Play also weighs, measures and certifies the ping-pong balls before the drawing.

The drawing process occurs in the following manner: All 14 balls are placed in the lottery machine and they are mixed for 20 seconds, and then the first ball is removed. The remaining balls are mixed in the lottery machine for another 10 seconds, and then the second ball is drawn. There is a 10-second mix, and then the third ball is drawn. There is a 10-second mix, and then the fourth ball is drawn. The team that has been assigned that combination will receive the No. 1 pick. The same process is repeated with the same ping-pong balls and lottery machine for the second through fourth picks.

If the same team comes up more than once, the result is discarded and another four-ball combination is selected. Also, if the one unassigned combination is drawn, the result is discarded and the balls are drawn again. The length of time the balls are mixed is monitored by a timekeeper who faces away from the machine and signals the machine operator after the appropriate amount of time has elapsed.

You probably don’t need me to explain how complicated this all is, compared to the two lines of code it would take to instruct the same procedure electronically. Arguably, perhaps, seeing the lottery carried out with the physical presence of ping pong balls might stop people thinking the results had been fixed. Except it doesn’t. So, it’s all just for show. Why do things efficiently and electronically when you can add razzmatazz and generate high tv ratings? Watching a statistician generate the same ratings for a couple of minutes on a laptop maybe just wouldn’t have the same appeal.

Anyway, my real reason for including this topic in the blog is the following. In several previous posts I’ve mentioned the use of simulation as a statistical technique. Applications are varied, but in most cases simulation is used to generate many realisations from a probability model in order to get a picture of what real data are likely to look like if their random characteristics are somehow linked to that probability model. 

For example, in this post I simulated how many packs of Panini stickers would be needed to fill an album. Calculating the probabilities of the number of packs needed to complete an album is difficult, but the simulation of the process of completing an album is easy.

And in a couple of recent posts (here and here) we used simulation techniques to verify what seemed like an easy intuitive result. As it turned out, the simulated results were different from what the theory suggested, and a slightly deeper study of the problem showed that some care was needed in the way the data wee simulated. But nonetheless, the principle of using simulations to investigate the expected outcomes of a random experiment were sound. In each case simulations were used to generate data from a process whose probabilities would have been practically impossible to calculate by other means.

Which brings me to this article, sent to me by Oliver.Cobb@smartodds.co.uk. On the day of the draft lottery, the masterminds at USA Today decided to run 100 simulations of the draft lottery to see which team would get the first pick. It’s mind-numbingly pointless. As Ollie brilliantly put it:

You have to admire the way they’ve based an article on taking a known chance of something happening and using just 100 simulations to generate a less reliable figure than the one they started with.

In case you’re interested, and can’t be bothered with the article, Chicago got selected for first pick most often – 19 times – in the 100 USA Today simulations, and were therefore ‘predicted’ to win the lottery.  But if they’d run their simulations much more often, it’s 100% guaranteed that Chicago wouldn’t have won, but would have been allocated first pick close to the 12.5% of occasions corresponding to their probability in the table above. With enough simulations, the simulated game would always be won by one of New York, Cleveland or Phoenix, whose proportions would only be separated by small amounts due to random variation.

The only positive thing you can say about the USA Today article, is that at least they had the good sense not to do the simulation with 14 actual ping pong balls. As they say themselves:

So to celebrate one of the most cruel and unusual days in sports, we ran tankathon.com’s NBA draft lottery simulator 100 times to predict how tonight will play out. There’s no science behind this. We literally hit “sim lottery” 100 times and wrote down the results.

I especially like the “there’s no science behind this” bit.  Meantime, if you want to create your own approximation to a known set of probabilities, you too can hit the “sim lottery” button 100 times here.


Update: Benoit.Jottreau@Smartodds.co.uk pointed me at this article, which is relevant for two reasons. First, in terms of content. In previous versions of the lottery system, there was a stronger incentive in terms of probability assignments for teams to do badly in the league. This led to teams ‘tanking’: deliberately throwing games towards the end of a season when they knew they were unlikely to reach the playoffs, thereby improving their chances of getting a better player in the draft for the following season. The 2019 version of the lottery aims to reduce this effect, by giving teams less of an incentive to be particularly poor. For example, the lowest three teams in the league now share the highest probability of first pick in the draft, whereas previously the lowest team had a higher probability than all others. But the article Benoit sent me suggests that the changes are unlikely to have much of an impact. It concludes:

…it seems that teams that want to tank still have strong incentives to tank, even if the restructured NBA draft lottery makes it less likely for them to receive the best picks.

The other reason why this article is relevant is that it makes much more intelligent use of simulation as a technique than the USA Today article referred to above.

Revel in the amazement

In an earlier post I included the following table:

As I explained, one of the columns contains the genuine land areas of each country, while the other is fake. And I asked you which is which.

The answer is that the first column is genuine and the second is fake. But without a good knowledge of geography, how could you possibly come to that conclusion?

Well, here’s a remarkable thing. Suppose we take just the leading digit of each  of the values. Column 1 would give 6, 2, 2, 1,… for the first few countries, while column 2 would give 7, 9, 3, 3,… It turns out that for many naturally occurring phenomena, you’d expect the leading digit to be 1 on around 30% of occasions. So if the actual proportion is a long way from that value, then it’s likely that the data have been manufactured or manipulated.

Looking at column 1 in the table, 5 out of the 20 countries have a population with leading digit 1; that’s 25%. In column 2, none do; that’s 0%. Even 25% is a little on the low side, but close enough to be consistent with 30% once you allow for discrepancies due to random variations in small samples. But 0% is pretty implausible. Consequently, column 1 is consistent with the 30% rule, while column 2 is not, and we’d conclude – correctly – that column 2 is faking it.

But where does this 30% rule come from? You might have reasoned that each of the digits 1 to 9 were equally likely – assuming we drop leading zeros – and so the percentage would be around 11% for a leading digit of 1, just as it would be for any of the other digits. Yet that reasoning turns out to be misplaced, and the true value is around 30%.

This phenomenon is a special case of something called Benford’s law, named after the physicist Frank Benford who first formalised it. (Though it had also been noted much earlier by the astronomer Simon Newcomb). Benford’s law states that for many naturally occurring datasets, the probability that the leading digit of a data item is 1 is equal to 30.1%. Actually, Benford’s law goes further than that, and gives the percentage of times you’d get a 2 or a 3 or any of the digits 1-9 as the leading digit. These percentages are shown in the following table.

Leading Digit 1 2 3 4 5 6 7 8 9
Frequency 30.1% 17.6% 12.5% 9.7% 7.9% 6.7% 5.8% 5.1% 4.6%

For those of you who care about such things, these percentages are log(2/1), log(3/2), log(4/3) and so on up to log(10/9), where log here is logarithm with respect to base 10.

But does Benford’s law hold up in practice? Well, not always, as I’ll discuss below. But often it does. For example, I took a dataset giving the altitudes of a large set of football stadiums around the world. I discarded a few whose altitude is below sea level, but was still left with over 13,000 records. I then extracted the leading digit of each of the altitudes (in metres)  and plotted a histogram of these values. This is just a plot of the percentages of occasions each value occurred. These are the blue bars in the following diagram. I then superimposed the predicted proportions from Benford’s law. These are the black dots.

 

The agreement between the observed percentages and those predicted by Benford’s law is remarkable. In particular, the observed percentage of leading digits equal to 1 is almost exactly what Benford’s law would imply. I promise I haven’t cheated with the numbers.

As further examples, there are many series of mathematically generated numbers for which Benford’s law holds exactly.

These include:

  • The Fibonacci series: 1, 1, 2, 3, 5, 8, 13, …. where each number is obtained by summing the 2 previous numbers in the series.
  • The integer powers of two: 1, 2, 4, 8, 16, 32, …..
  • The iterative series obtained by starting with any number and successively multiplying by 3. For example, starting with 7, we get: 7, 21, 63, 189,….

In each of these cases of infinite series of numbers, exactly 30.1% will have leading digit equal to 1; exactly 17.6% will have leading digit equal to 2, and so on.

And there are many other published examples of data fitting Benford’s law (here, here, here… and so on.)

Ok, at this point you should pause to revel in the amazement of this stuff. Sometimes mathematics, Statistics and probability come together in a way to explain naturally occurring phenomena that is so surprising and shockingly elegant it takes your breath away.

So, when does Benford’s law work. And why?

It turns out there are various ways of explaining Benford’s law, but none of them – at least as far as I can tell – is entirely satisfactory. All of them require a leap of faith somewhere to match the theory to real-life. This view is similarly expressed in an academic article, which concludes:

… there is currently no unified approach that simultaneously explains (Benford’s law’s) appearance in dynamical systems, number theory, statistics, and real-world data.

Despite this, the various arguments used to explain Benford’s law do give some insight into why it might arise naturally in different contexts:

  1. If there is a law of this type, Benford’s law is the only one that works for all choices of scale. The decimal representation of numbers is entirely arbitrary, presumably deriving from the fact that humans, generally, have 10 fingers. But if we’d been born with 8 fingers, or chosen to represent numbers anyway in binary, or base 17, or something else, you’d expect a universal law to be equally valid, and not dependent on the arbitrary choice of counting system. If this is so, then it turns out that Benford’s law, adapted in the obvious way to the choice of scale, is the only one that could possibly hold. An informal argument as to why this should be so can be found here.
  2. If the logarithm of the variable under study has a distribution that is smooth and roughly symmetric – like the bell-shaped normal curve, for example – and is also reasonably well spread out, it’s easy to show that Benford’s law should hold approximately. Technically, for those of you who are interested, if X is the thing we’re measuring, and if log X has something like a normal distribution with a variance that’s not too small, then Benford’s law is a good approximation for the behaviour of X. A fairly readable development of the argument is given here. (Incidentally, I stole the land area of countries example directly from this reference.)

But in the first case, there’s no explanation as to why there should be a universal law, and indeed many phenomena – both theoretical and in nature – don’t follow Benford’s law. And in the second case, except for special situations where the normal distribution has some kind of theoretical justification as an approximation, there’s no particular reason why the logarithm of the observations should behave in the required way. And yet, in very many cases – like the land area of countries or the altitude of football stadiums – the law can be shown empirically to be a very good approximation to the truth.

One thing which does emerge from these theoretical explanations is a better understanding of when Benford’s law is likely to apply and when it’s not. In particular, the argument only works when the logarithm of the variable under study is reasonably well spread out. What that means in practice is that the variable itself needs to cover several orders of magnitude: tens, hundreds, thousands etc. This works fine for something like the stadium altitudes, which vary from close to sea-level up to around 4,000 metres, but wouldn’t work for total goals in football matches, which are almost always in the range 0 to 10, for example.

So, there are different ways of theoretically justifying Benford’s law, and empirically it seems to be very accurate for different datasets which cover orders of magnitude. But does it have any practical uses? Well, yes: applications of Benford’s law have been made in many different fields, including…

Finally, there’s also a version of Benford’s law for the second digit, third digit and so on. There’s an explanation of this extension in the Wikipedia link that I gave above. It’s probably not easy to guess exactly what the law might be in these cases, but you might try and guess how the broad pattern of the law changes as you move from the first to the second and to further digits.


Thanks to those of you wrote to me after I made the original post. I don’t think it was easy to guess what the solution was, and indeed if I was guessing myself, I think I’d have been looking for a uniformity in the distribution of the digits, which turns out to be completely incorrect, at least for the leading digit. Even though I’ve now researched the answer myself, and made some sense of it, I still find it rather shocking that the law works so well for an arbitrary dataset like the stadium altitudes. Like I say: revel in the amazement.

Statty night

Apologies for the terrible pun in the title.

When I used to teach Statistics I tried to emphasise to students that Statistics is as much an art as a science. Statisticians are generally trying to make sense of some aspect of the world, and they usually have just some noisy data with which to try to do it. Sure, there are algorithms and computer packages they can chuck data into and get simple answers out of. But usually those answers are meaningless unless the algorithm/package is properly tailored to the needs of the specific problem. And there are no rules as to how that is best done: it needs a good understanding of the problem itself, an awareness of the data that are available and the creative skill to be able to mesh those things with appropriate statistical tools. And these are skills that are closer to the mindset of an artist than of a scientist.

But anyway… I recently came across the following picture which turns the tables, and uses Statistics to make art. (Or to destroy art, depending on your point of view). You probably recognise the picture at the head of this post as Van Gogh’s Starry Night, which is displayed at MOMA in New York.

By contrast, the picture below is a statistical reinterpretation of the original version of Starry Night, created by photographer Mario Klingemann through a combination of data visualisation and statistical summarisation techniques .

The Starry Night Pie Packed

As you can see, the original painting has been replaced by a collage of coloured circles, which are roughly the same colour as the original painting. But in closer detail, the circles have an interesting structure. Each is actually a pie chart whose slices in size and colour correspond the proportions of colours in that region of the original picture.

Yes, pointless, but kind of fun nonetheless. You can find more examples of Klingemann’s statistically distorted classical artworks here.

In similar vein… the diagram below, produced by artist Arthur Buxton, is actually a quiz. Each of the pie charts represents the proportions of the main colours in one of Van Gogh’s paintings. In other words, these pie charts represent the colour distributions over a whole Van Gogh painting, rather than just a small region of a picture, as in the painting above. The quiz is to identify which Van Gogh painting each of the pie charts refers to.

You can find a short description of Arthur Buxton’s process in developing this picture here.

There’s just a small snag: I haven’t been able to locate the answers. My guess is that the pie chart in column 2 of row 2 corresponds to Starry Night. And the one immediately to the left of that is from the Sunflower series. But that’s pretty much exhausted my knowledge of the works of Van Gogh. Let me know if you can identify any of the others and I’ll add them to a list below.


On the basis of experience with jigsaw puzzles – hey, we’re all on a learning curve and you never know when acquired knowledge will be useful – Nity.Raj@Smartodds.co.uk reliably informs me that the third pie chart from the left on the bottom row will correspond to one of the paintings from Van Gogh’s series of Irises. Looking at this link which Nity gave me it seems entirely plausible.