The China syndrome

In a couple of earlier posts I’ve mentioned how statistical analyses have sometimes been used to demonstrate that results in published analyses are ‘too good to be true’. One of these cases concerned Mendel’s laws of genetic inheritance. Though the laws have subsequently been shown to be unquestionably true, Mendel’s results on pea experiments were insufficiently random to be credible. The evidence strongly suggests that Mendel tweaked his results to fit the laws he believed to be true. He just didn’t understand enough about statistics to realise that the very laws he wanted to establish also implied sizeable random variation around predicted results, and the values he reported were much too close to the predicted values to be plausible.

As discussed in a recent academic article, a similar issue has been discovered in respect of official Chinese figures for organ donation. China has recently come under increasing international pressure to discontinue its practice of using organs of dead prisoners for transplants. One issue was consent – did prisoners consent to the use of their organs before their death? But a more serious issue was with respect to possible corruption and even the possibility that  some prisoners were executed specifically to make their organs available.

Anyway, since 2010 China has made efforts to discontinue this practice, replacing it with a national system of voluntary organ donation. Moreover, they announced that from 2015 onwards only hospital-based voluntary organ donations would be used for transplants.  And as evidence of the success of this program, two widely available datasets published respectively by the China Organ Transplant Response System (COTRS)  and the Red Cross Society of China, show rapid growth in the numbers of voluntary organ donations, which would more than compensate for the cessation of the practice of donations from prisoners.

Some of the yearly data counts from the COTRS database are shown in this figure taken from the report references above. The actual data are shown by points (or triangles and crosses); the curves have been artificially added to show the general trend in the observed data. Clearly, for each of the count types, one can observe a rapid growth rate in the number of donations.

But… here’s the thing… look at how closely the smooth curves approximate the data values. The fit is almost perfect for each of the curves. And there’s a similar phenomenon for other data, including the Red Cross data. But when similar relationships are looked at for data from other countries, something different happens: the trend is generally upwards, as in this figure, but the data are much more variable around the trend curve.

In summary, it seems much more likely that the curves have been chosen, and the data chosen subsequently to fit very closely to the curves. But just like Mendel’s pea data, this has been done without a proper awareness that nature is bound to lead to substantial variations around an underlying law. However, unlike Mendel, who presumably just invented numbers to take shortcuts to establish a law that was true, the suspicion remains that neither the data nor the law are valid in the case of the Chinese organ donation numbers.

A small technical point for those of you that might be interested in such things. The quadratic curves in the above plot were fitted in the report by the method of simple least squares, which aims to find the quadratic curve which minimises the overall distance between the points and the curve. As a point of principle, I’d argue this is not very sensible. When the counts are bigger, one would expect to get more variation, so we’d probably want to downweight the value of the variation for large counts, and increase it for the lower counts. In other words, we’d expect the curve to fit better in the early years and worse in the later years, and we should take that into account when fitting the curve. In practice, the variations around the curves are so small, the results obtained by doing things this way are likely to be almost identical. So, it’s just a point of principle more than anything else. But still, in an academic paper which purports to use the best available statistics to discredit the claim made by a national government, it would probably be best to make sure you really are using the most appropriate statistical methods for the analysis.

Juvenile dinosaurs

This blog is mostly about Statistics as a science rather than statistics as numbers. But just occasionally the statistics themselves are so shocking, they’re worthy of a mention.

With this in mind I was struck by two statistics of a similar theme in the following tweet from Ben Goldacre (author of the Bad Science website and book):

Moreover, in the discussion following Ben’s tweet, someone linked to the following cartoon figure:

This shows that even if you change the way of measuring distance from time to either phylogenetic distance or physical similarity, the following holds: the distance between a sparrow and T-Rex is smaller than that between T-Rex and Stegosaurus.

Footnote 1: this is more than a joke. Recent research makes the case that there is a strong evolutionary link between birds and dinosaurs. As one of the authors writes:

We now understand the relationship between birds and dinosaurs that much better, and we can say that, when we look at birds, we are actually looking at juvenile dinosaurs.

Footnote 2. Continuing the series (also taken from the discussion of Ben’s tweet)… Cleopatra is closer in time to the construction of the space shuttle than the pyramids.

Footnote 3. Ben Goldacre’s book, Bad Science, is a great read. It includes many examples of the way science – and Statistics – can be misused.

Relatively speaking

Last week, when discussing Kipchoge’s recent sub 2-hour marathon run, I showed the following figure which compares histograms of marathon race times in a large database of male and female runners.

I mentioned then that I’d update the post to discuss the other unusual shape of the histograms. The point I intended to make concerns the irregularity of the graphs. In particular, there are many spikes, especially before the 3, 3.5 and 4 hour marks. Moreover, there is a very large drop in the histograms – most noticeably for men – after the 4 hour mark.

This type of behaviour is unusual in random processes:. frequency diagrams of this type, especially those  based on human characteristics, are generally much smoother. Naturally, with any sample data, some degree of irregularity in frequency data is inevitable, but:

1. These graphs are based on a very large sample of more than 3 million runners, so random variations are likely to be very small;
2. Though irregular in shape, the timings of the irregularities are themselves regular.

So, what’s going on?

The irregularities are actually a consequence of the psychology of marathon runners attempting to achieve personal targets. For example, many ‘average’ runners will set a race time target of 4 hours. Then, either through a programmed training regime or sheer force of will on the day of the race, will push themselves to achieve this race time. Most likely not by much, but enough to be on the left side of the 4-hour mark.

The net effect of many runners behaving similarly is to cause a surge of race times just before the 4-hour mark and a dip thereafter. There’s a similar effect at 3 and 3.5 hours – albeit of a slightly smaller magnitude – and smaller effects still at what seem to be around 10 minute intervals. So, the spikes in the histograms are due to runners consciously adapting their running pace to meet self-set objectives which are typically at regular times like 3, 3.5, 4 hours and so on.

Thanks to those of you that wrote to me to explain this effect.

Actually though, since writing the original post, something else occurred to me about this figure, which is why I decided to write this separate post instead of just updating the original one. Take a look at the right hand side of the plot – perhaps from a finish time of around 5 hours onwards. The values of the histograms are pretty much the same for men and women in this region. This contrasts sharply with the left side of the diagram where there are many more men than women finishing the race in, say, less than 3 hours. So, does this mean that although at faster race times there are many more men than women, at slow race times there are just as many women as men?

Well, yes and no. In absolute terms, yes: there are pretty much the same number of men as women completing the race with a time of around 6 hours. But… this ignores the fact that there are actually many more men than women overall – one of the other graphics on the page from which I copied the histograms states that the male:female split in the database is 61.8% to 31.2%. So, although the absolute numbers of men race times is similar to that of women, the proportion of runners that represents is considerably lower compared to women.

Arguably, comparing histograms gives a misleading representation of the data. It makes it look as though men and women are equally likely to have a race time of around 6 hours. Though true, this is only because many more men than women run the marathon.  The proportion of men completing the race with a time of around 6 hours is considerably smaller than that of women.

The same principle holds at all race times but is less of an issue when interpreting the graph. For example, the difference in proportions of men and women having a race time of around 4 hours is smaller than that of the actual frequencies in the histograms above, but it is still a big difference. It’s really where the absolute frequencies are similar that the picture above can be misleading.

In summary: there is a choice when drawing histograms of using absolute or relative frequencies. (Or counts and percentages). When looking at a single histogram it makes little difference – the shape of the histogram will be identical in both cases. When comparing two or more sets of results, histograms based on relative frequencies are generally easier to interpret. But in any case, when interpreting any statistical diagram, always look at the fine detail provided in the descriptions on the axes so as to be sure what you’re looking at.

Footnote:

Some general discussion and advice on drawing histograms can be found here.

No smoke without fire

No one seriously now doubts that cigarette smoking increases your risk of lung cancer and many other diseases, but when the evidence for a relationship between smoking and cancer was first presented in the 1950’s, it was strongly challenged by the tobacco industry.

The history of the scientific fight to demonstrate the harmful effects of smoking is summarised in this article. One difficulty from a statistical point of view was that the primary evidence based on retrospective studies was shaky, because smokers tend to give unreliable reports on how much they smoke. Smokers with illnesses tend to overstate how much they smoke; those who are healthy tend to understate their cigarette consumption. And these two effects lead to misleading analyses of historically collected data.

An additional problem was the difficulty of establishing causal relationships from statistical associations. Similar to the examples in a previous post, just because there’s a correlation between smoking and cancer, it doesn’t necessarily mean that smoking is a risk factor for cancer. Indeed, one of the most prominent statisticians of the time – actually of any time – Sir Ronald Fisher, wrote various scientific articles explaining how the correlations observed between smoking and cancer rates could easily be explained by the presents of lurking variables that induce spurious correlations.

At which point it’s worth noting a couple more ‘coincidences’: Fisher was a heavy smoker himself and also an advisor to the Tobacco Manufacturers Standing Committee. In other words, he wasn’t exactly neutral on the matter. But, he was a highly respected scientist, and therefore his scepticism carried considerable weight.

Eventually though, the sheer weight of evidence – including that from long-term prospective studies – was simply too overwhelming to be ignored, and governments fell into line with the scientific community in accepting that smoking is a high risk factor for various types of cancer.

An important milestone in that process was the work of another British statistician, Austin Bradford Hill. As well as being involved in several of the most prominent cases studies linking cancer to smoking, he also developed a set of 9 (later extended to 10) criteria for establishing a causal relationship between processes. Though still only guidelines, they provided a framework that is still used today for determining whether associated processes include any causal relationships. And by these criteria, smoking was clearly shown to be a risk factor for smoking.

Now, fast-forward to today and there’s a similar debate about global warming:

1. Is the planet genuinely heating up or is it just random variation in temperatures?
2. If it’s heating up, is it a consequence of human activity, or just part of the natural evolution of the planet?
3. And then what are the consequences for the various bio- and eco-systems living on it?

There are correlations all over the place – for example between CO2 emissions and average global temperatures as described in an earlier post – but could these possibly just be spurious and not indicative of any causal relationships?  Certainly there are industries with vested interests who would like to shroud the arguments in doubt. Well, this nice article applies each of Bradford Hill’s criteria to various aspects of climate science data and establishes that the increases in global temperatures are undoubtedly caused by human activity leading to CO2 release in the atmosphere, and that many observable changes to biological and geographical systems are a knock-on effect of this relationship.

In summary: in the case of the planet, the smoke that we see <global warming> is definitely a consequence of the fire we stared <the increased amounts of CO2 released into the atmosphere>.

Weapons of math destruction

I haven’t read it, but Cathy O’Neil’s ‘Weapons of Math Destruction‘  is a great title for a book. Here’s what one reviewer wrote:

Cathy O’Neil an experienced data scientist and mathematics professor illustrates the pitfalls of allowing data scientists to operate in a moral and ethical vacuum including how the poor and disadvantaged are targeted for payday loans, high cost insurance and political messaging on the basis of their zipcodes and other harvested data.

So, WOMD shows how the data-based algorithms that increasingly form the fabric of our lives – from Google to Facebook to banks to shopping to politics – and the statistical methodology behind them are actually pushing societies in the direction of greater inequality and reduced democracy.

At the time of writing WOMD these arguments were still in their infancy; but now we are starting to live the repercussions of the success of the campaign to remove Britain from the EU – which was largely driven by a highly professional exercise in Data Science – they seem much more relevant and urgent.

Anyway, Cathy O’Neil herself recently gave an interview to Bloomberg. Unfortunately, you now have to subscribe to read the whole article, so you won’t see much if you follow the link. But it was an interesting interview for various reasons. In particular, she discussed the trigger which led her to a love of data and mathematics. She wrote that when she was nine her father showed her a mathematics puzzle. And solving that problem led Cathy to a lifelong appreciation of the power of mathematical thinking. She wrote..

It’s more of a mathematical than a statistical puzzle, but maybe you’d like to think about it for yourself anyway…

Consider this diagram:

It’s a chessboard with 2 of the corner squares removed. Now, suppose you had a set of 31 dominoes, with each domino being able to cover 2 adjacent horizontal or vertical squares. Your aim is to find a way of covering the 62 squares of the mutilated board with the 31 dominoes. If you’d like to try it, mail me with either a diagram or photo of your solution; or, if you think it can’t be done, mail me an explanation. I’ll discuss the solution in a future post.

Terrible maps

One of the themes in this blog has been the creative  use of diagrams to represent statistical data. When the data are collected geographically this amounts to using maps to represent data – perhaps using colours or shadings to show how a variable changes over a region, country or even the whole world.

With this in mind I recommend to you @TerribleMaps on twitter.

It’s usually entertaining, and sometimes – though not always – scientific. Here are a few recent examples:

1. Those of you with kids are probably lamenting right now the length of the summer holidays. But just look how much worse it could be if, for example, you were living in Italy (!):
2. Just for fun… a map of the United States showing the most commonly used word in each state:
3. A longitudinal slicing of the world by population size. It’s interesting because the population per size will depend both on the number of countries that are included as well as the population density in those slices.
4. For each country in the following map, the flag shown is that of the country with which it shares the longest border. For example, the UK has its longest border with Ireland, and so is represented by the Ireland flag. Similarly, France’s flag is that of Brazil!
5. This one probably only makes sense if you were born in, or have spent time living in, Italy
6. While this one will help you get clued-up on many important aspects of UK culture:
7. And finally, this one will help you understand how ‘per capita’ calculations are made. You might notice there’s one country with an N/A entry. Try to identify which country that is and explain why its value is  missing.

In summary, as you’ll see from these examples, the maps are usually fun, sometimes genuinely terrible, but sometimes contain a genuine pearl of statistical or geographical wisdom. If you have to follow someone on twitter, there are worse choices you could make.

Zipf it

In a recent post I explained that in a large database of containing the words from many English language texts of various types, the word ‘football’ occurred 25,271 times, making it the 1543rd most common word in the database. I also said that the word ‘baseball’ occurred 28,851 times, and asked you to guess what its rank would be.

With just this information available, it’s impossible to say with certainty what the exact rank will be. We know that ‘baseball’ is more frequent than ‘football’ and so it must have a higher rank (which means a rank with a lower number). But that simply means it could be anywhere from 1 to 1542.

However, we’d probably guess that ‘baseball’ is not so much more popular a word than ‘football’; certainly other words like ‘you’, ‘me’, ‘please’ and so on are likely to occur much more frequently. So, we might reasonably guess that the rank of ‘baseball’ is closer to the lower limit of 1542 than it is to the upper limit of 1. But where exactly should we place it?

Zipf’s law provides a possible answer.

In its simplest form Zipf’s law states that for many types of naturally occurring data – including frequencies of word counts – the second most common word occurs half as often as the most common; the third most common occurs a third as often as the most popular; the fourth most common occurs a quarter as often; and so on. If we denote by f(r) the frequency of the item with rank r, this means that

$f(r) = C/r$

or

$r\times f(r)=C$,

where C is the constant f(1). And since this is true for every choice of r, the frequencies and ranks of the words ranked r and s are related by

$r\times f(r)=s \times f(s)$.

Then, assuming Zipf law applies,

$rank(\mbox{baseball'}) = rank(football') \times f(\mbox{football'})/f(\mbox{baseball'})$

$= 1543 \times 25271/28851 \approx 1352$

So, how accurate is this estimate? The database I extracted the data from is the well-known Brown University Standard Corpus of Present-Day American EnglishThe most common 5000 words in the database, together with their frequencies, can be found here. Searching down the list, you’ll find that the rank of ‘baseball’ is 1380, so the estimated value of 1352 is not that far out.

But where does Zipf’s law come from? It’s named after the linguist George Kingsley Zipf (1902-1950), who observed the law to hold empirically for words in different languages. Rather like Benford’s law, which we discussed in an earlier post, different arguments can be constructed that suggest Zipf’s law might be appropriate in certain contexts, but none is overwhelmingly convincing, and it’s really the body of empirical evidence that provides its strongest support.

Actually, Zipf’s law

$f(r) = C/r,$

is equivalent to saying that the frequency distribution follows a power law where the power is equal to -1. But many fits of the model to data can be improved by generalising this model to

$f(r)=C/r^k$

for some constant k. In this more general form the law has been shown to work well in many different contexts, including sizes of cities, website access counts, gene expression frequencies and strength of volcanic eruptions. The version with k=1 is found to work well for many datasets based on frequencies of word counts, but other datasets often require different values of k. But to use this more general version of the law we’d have to know the value of k, which we could estimate if we had sufficient amounts of data. The simpler Zipf’s law has k=1 implicitly, and so we were able to estimate the rank of ‘baseball’ with just the limited amount of information provided.

Finally, I had just 3 responses to the request for predictions of the rank of ‘baseball’: 1200, 1300 and 1450, each of which is entirely plausible. But if I regard each of these estimates as those of an expert and try combining those expert opinions by taking the average I get 1317, which is very close to the Zipf law prediction of 1352. Maybe if I’d had more replies the average would have been even closer to the Zipf law estimate or indeed to the true answer itself 😏.

Data controversies

Some time ago I wrote about Mendel’s law of genetic inheritance, and how statistical analysis of Mendel’s data suggested his results were too good to be true. It’s not that his theory is wrong; it’s just that the data he provided as evidence for his theory seem to have been manipulated in such a way as to seem incontrovertible. Unfortunately the data lack the variation that Mendel’s own law would also imply should occur in measurements of that type, leading to the charge that the data had been manufactured or manipulated in some way.

The photograph, taken 100 years ago, was as striking at that time as the recent picture of a black hole, discussed in an earlier post, is today. However, this picture was taken with basic photographic equipment and telescopic lens and shows a total solar eclipse, as the moon passes directly between the Earth and the Sun.

A full story of the controversy is given here.

In summary: Einstein’s theory of general relativity describes gravity not as a force between two attracting masses – as is central to Newtonian physics – but as a curvature caused in space-time due to the presence of massive objects. All objects cause such curvature, but only those that are especially massive, such as stars and planets, will have much of an effect.

Einstein’s relativity model was completely revolutionary compared to the prevailing view of physical laws at the time. But although it explained various astronomical observations that were anomalous according to Newtonian laws, it had never been used to predict anomalous behaviour. The picture above, and similar ones taken at around the same time, changed all that.

In essence, blocking out the sun’s rays enabled dimmer and more distant stars to be accurately photographed. Moreover, if Einstein’s theory were correct, the photographic position of these stars should be slightly distorted because of the spacetime curvature effects of the sun. But the effect is very slight, and even Newtonian physics suggests some disturbance due to gravitational effects.

In an attempt to get photographic evidence at the necessary resolution, the British astronomer Arthur Eddington set up two teams of scientists – one on the African island of Príncipe, the other in Sobral, Brazil – to take photographs of the solar eclipse on 29 May, 1919. Astronomical and photographic equipment was much more primitive in those days, so this was no mean feat.

Anyway, to cut a long story short, a combination of poor weather conditions and other setbacks meant that the results were less reliable than were hoped for. It seems that the data collected at Príncipe, where Eddington himself was stationed, were inconclusive, falling somewhere between the Newton and Einstein model predictions. The data at Sobral were taken with two different types of telescope, with one set favouring the Newton view and the other Einstein’s. Eddington essentially combined the Einstein-favouring data from Sobral together with those from Príncipe and concluded that the evidence supported Einsteins relativistic model of the universe.

Now, in hindsight, with vast amounts of empirical evidence of many types, we know Einstein’s model to be fundamentally correct. But did Eddington selectively choose his data to support Einstein’s model?

There are different points of view, which hinge on Eddington’s motivation for dropping a subset of the Sobral data from his analysis. One point of view is that he wanted Einstein’s view to be correct, and therefore simply ignored the data that were less favourable. This argument is fuelled by political reasoning: it sarges that since Eddington was a Quaker, and therefore a pacifist, he wanted to support a German theory as a kind of post-war reconciliation.

The alternative point of view, for which there is some documentary evidence, is that the Sobral data which Eddington ignored had been independently designated as unreliable. Therefore, on proper scientific grounds, Eddington had behaved entirely correctly by excluding it from his analysis, and his subsequent conclusions favouring the Einstein model were entirely consistent with the scientific data and information he had available.

This issue will probably never be fully resolved, though in a recent review of several books on the matter, theoretical physicist Peter Coles (no relation) claims to have reanalysed the data given in the Eddington paper using modern statistical methods, and found no reason to doubt his integrity. I have no reason to doubt that point of view, but there’s no detail of the statistical analysis that was carried out.

What’s interesting though, from a statistical point of view, is how the interpretation of the results depends on the reason for the exclusion of a subset of the Sobral data. If your view is that Eddington knew their contents and excluded them on that basis, then his conclusions in favour of Einstein must be regarded as biased. If you accept that Eddington excluded these data a priori because of their unreliability, then his conclusions were fair and accurate.

Data are often treated as a neutral aspect of an analysis. But as this story illustrates, the choice of which data to include or exclude, and the reasons for doing so, may be factors which fundamentally alter the direction an analysis will take, and the conclusions it will reach.

Word rank

I recently came across a large database of the use of English-American words. It aims to provide a representative sample of the usage English-American by including the words extracted from a large number of English texts of different types – books, newspaper articles, magazines etc. In total it includes around 560 million words collected over the years 1990-2017.

The word ‘football’ occurs in the database 25,271 times and has rank 1543. In principle, this means that ‘football’ was the 1543rd most frequent word in the database, though the method used for ranking the database elements is a little more complicated than that, since it attempts to combine a measure of both the number of times the word appears and the number of texts it appears in. Let’s leave that subtlety aside though and assume that ‘football’, with a frequency of 25,271, is the 1543rd most common word in the database.

The word ‘baseball’ occurs in the same database 28,851 times. With just this information, what would you predict the rank of the word ‘baseball’ to be? For example, if you think ‘baseball’ is the most common word, it would have rank 1. (It isn’t: ‘the’ is the most common word). If you think ‘baseball’ would be the 1000th most common word, your answer would be 1000.

Give it a little thought, but don’t waste time on it. I really just want to use the problem as an introduction to an issue that I’ll discuss in a future post. I’d be happy to receive your answer though, together with an explanation if you like, by mail. Or if you’d just like to fire an answer anonymously at me, without explanation, you can do so using this survey form.

Revel in the amazement

In an earlier post I included the following table:

As I explained, one of the columns contains the genuine land areas of each country, while the other is fake. And I asked you which is which.

The answer is that the first column is genuine and the second is fake. But without a good knowledge of geography, how could you possibly come to that conclusion?

Well, here’s a remarkable thing. Suppose we take just the leading digit of each  of the values. Column 1 would give 6, 2, 2, 1,… for the first few countries, while column 2 would give 7, 9, 3, 3,… It turns out that for many naturally occurring phenomena, you’d expect the leading digit to be 1 on around 30% of occasions. So if the actual proportion is a long way from that value, then it’s likely that the data have been manufactured or manipulated.

Looking at column 1 in the table, 5 out of the 20 countries have a population with leading digit 1; that’s 25%. In column 2, none do; that’s 0%. Even 25% is a little on the low side, but close enough to be consistent with 30% once you allow for discrepancies due to random variations in small samples. But 0% is pretty implausible. Consequently, column 1 is consistent with the 30% rule, while column 2 is not, and we’d conclude – correctly – that column 2 is faking it.

But where does this 30% rule come from? You might have reasoned that each of the digits 1 to 9 were equally likely – assuming we drop leading zeros – and so the percentage would be around 11% for a leading digit of 1, just as it would be for any of the other digits. Yet that reasoning turns out to be misplaced, and the true value is around 30%.

This phenomenon is a special case of something called Benford’s law, named after the physicist Frank Benford who first formalised it. (Though it had also been noted much earlier by the astronomer Simon Newcomb). Benford’s law states that for many naturally occurring datasets, the probability that the leading digit of a data item is 1 is equal to 30.1%. Actually, Benford’s law goes further than that, and gives the percentage of times you’d get a 2 or a 3 or any of the digits 1-9 as the leading digit. These percentages are shown in the following table.

Leading Digit 1 2 3 4 5 6 7 8 9
Frequency 30.1% 17.6% 12.5% 9.7% 7.9% 6.7% 5.8% 5.1% 4.6%

For those of you who care about such things, these percentages are log(2/1), log(3/2), log(4/3) and so on up to log(10/9), where log here is logarithm with respect to base 10.

But does Benford’s law hold up in practice? Well, not always, as I’ll discuss below. But often it does. For example, I took a dataset giving the altitudes of a large set of football stadiums around the world. I discarded a few whose altitude is below sea level, but was still left with over 13,000 records. I then extracted the leading digit of each of the altitudes (in metres)  and plotted a histogram of these values. This is just a plot of the percentages of occasions each value occurred. These are the blue bars in the following diagram. I then superimposed the predicted proportions from Benford’s law. These are the black dots.

The agreement between the observed percentages and those predicted by Benford’s law is remarkable. In particular, the observed percentage of leading digits equal to 1 is almost exactly what Benford’s law would imply. I promise I haven’t cheated with the numbers.

As further examples, there are many series of mathematically generated numbers for which Benford’s law holds exactly.

These include:

• The Fibonacci series: 1, 1, 2, 3, 5, 8, 13, …. where each number is obtained by summing the 2 previous numbers in the series.
• The integer powers of two: 1, 2, 4, 8, 16, 32, …..
• The iterative series obtained by starting with any number and successively multiplying by 3. For example, starting with 7, we get: 7, 21, 63, 189,….

In each of these cases of infinite series of numbers, exactly 30.1% will have leading digit equal to 1; exactly 17.6% will have leading digit equal to 2, and so on.

And there are many other published examples of data fitting Benford’s law (here, here, here… and so on.)

Ok, at this point you should pause to revel in the amazement of this stuff. Sometimes mathematics, Statistics and probability come together in a way to explain naturally occurring phenomena that is so surprising and shockingly elegant it takes your breath away.

So, when does Benford’s law work. And why?

It turns out there are various ways of explaining Benford’s law, but none of them – at least as far as I can tell – is entirely satisfactory. All of them require a leap of faith somewhere to match the theory to real-life. This view is similarly expressed in an academic article, which concludes:

… there is currently no unified approach that simultaneously explains (Benford’s law’s) appearance in dynamical systems, number theory, statistics, and real-world data.

Despite this, the various arguments used to explain Benford’s law do give some insight into why it might arise naturally in different contexts:

1. If there is a law of this type, Benford’s law is the only one that works for all choices of scale. The decimal representation of numbers is entirely arbitrary, presumably deriving from the fact that humans, generally, have 10 fingers. But if we’d been born with 8 fingers, or chosen to represent numbers anyway in binary, or base 17, or something else, you’d expect a universal law to be equally valid, and not dependent on the arbitrary choice of counting system. If this is so, then it turns out that Benford’s law, adapted in the obvious way to the choice of scale, is the only one that could possibly hold. An informal argument as to why this should be so can be found here.
2. If the logarithm of the variable under study has a distribution that is smooth and roughly symmetric – like the bell-shaped normal curve, for example – and is also reasonably well spread out, it’s easy to show that Benford’s law should hold approximately. Technically, for those of you who are interested, if X is the thing we’re measuring, and if log X has something like a normal distribution with a variance that’s not too small, then Benford’s law is a good approximation for the behaviour of X. A fairly readable development of the argument is given here. (Incidentally, I stole the land area of countries example directly from this reference.)

But in the first case, there’s no explanation as to why there should be a universal law, and indeed many phenomena – both theoretical and in nature – don’t follow Benford’s law. And in the second case, except for special situations where the normal distribution has some kind of theoretical justification as an approximation, there’s no particular reason why the logarithm of the observations should behave in the required way. And yet, in very many cases – like the land area of countries or the altitude of football stadiums – the law can be shown empirically to be a very good approximation to the truth.

One thing which does emerge from these theoretical explanations is a better understanding of when Benford’s law is likely to apply and when it’s not. In particular, the argument only works when the logarithm of the variable under study is reasonably well spread out. What that means in practice is that the variable itself needs to cover several orders of magnitude: tens, hundreds, thousands etc. This works fine for something like the stadium altitudes, which vary from close to sea-level up to around 4,000 metres, but wouldn’t work for total goals in football matches, which are almost always in the range 0 to 10, for example.

So, there are different ways of theoretically justifying Benford’s law, and empirically it seems to be very accurate for different datasets which cover orders of magnitude. But does it have any practical uses? Well, yes: applications of Benford’s law have been made in many different fields, including…

Finally, there’s also a version of Benford’s law for the second digit, third digit and so on. There’s an explanation of this extension in the Wikipedia link that I gave above. It’s probably not easy to guess exactly what the law might be in these cases, but you might try and guess how the broad pattern of the law changes as you move from the first to the second and to further digits.

Thanks to those of you wrote to me after I made the original post. I don’t think it was easy to guess what the solution was, and indeed if I was guessing myself, I think I’d have been looking for a uniformity in the distribution of the digits, which turns out to be completely incorrect, at least for the leading digit. Even though I’ve now researched the answer myself, and made some sense of it, I still find it rather shocking that the law works so well for an arbitrary dataset like the stadium altitudes. Like I say: revel in the amazement.