Woodland creatures

The hedgehog and the fox is an essay by philosopher Isaiah Berlin. Though published in 1993, the title is a reference to a fragment of a poem by the ancient Greek poet Archilochus. The relevant passage translates as:

… a fox knows many things, but a hedgehog one important thing.

Isaiah Berlin used this concept to classify famous thinkers: those whose ideas could be summarised by a single principle are hedgehogs; those whose ideas are more pragmatic, multi-faceted and evolving are foxes.

This dichotomy of approaches to thinking has more recently been applied in the context of prediction, and is the basis of the following short (less than 5-minute) video, kindly suggested to me by Richard.Greene@Smartodds.co.uk.

Watch and enjoy…

So, remarkably, in a study of the accuracy of individuals when making predictions, nothing made a difference: age, sex, political outlook… Except, ‘foxes’ are better predictors than ‘hedgehogs’: being well-versed in a single consistent philosophy is inferior to an adaptive and evolving approach to knowledge and its application.

The narrator, David Spiegelhalter, also summarises the strengths of a good forecaster as:

  1. Aggregation. They use multiple sources of information, are open to new knowledge and are happy to work in teams.
  2. Metacognition. They have an insight into how they think and the biases they might have, such as seeking evidence that simply confirms pre-set ideas.
  3. Humility. They have a willingness to acknowledge uncertainty, admit errors and change their minds. Rather than saying categorically what is going to happen, they are only prepared to give probabilities of future events.

(Could almost be a bible for a sports modelling company.)

These principles are taken from the book Future Babble by Dan Gardner, which looks like it’s a great read. The tagline for the book is ‘how to stop worrying and love the unpredictable’, which on its own is worth the cost of the book.


Incidentally, I could just have easily written a blog entry with David Spiegelhalter as part of my series of famous statisticians. Until recently he was the president of the Royal Statistical Society. He was also knighted in 2014 for his services to Statistics, and has numerous awards and honorary degrees.

His contributions to statistics are many, especially in the field of Medical Statistics.  Equally though, as you can tell from the above video, he is a fantastic communicator of statistical ideas. He also has a recent book out: The art of statistics: learning from data. I’d guess that if anyone wants to learn something about Statistics from a single book, this would be the place to go. I’ve just bought it, but haven’t read it yet. Once I do, if it seems appropriate, I’ll post a review to the blog.

Tennis puzzles

They’re not especially statistical, and not especially difficult, but I thought you might like to try some tennis-based puzzle questions. I’ve mentioned before that Alex Bellos has a fortnightly column in the Guardian where he presents mathematical puzzles of one sort or another. Well, to coincide with the opening of Wimbledon, today’s puzzles have a tennis-based theme. You can find them here.

I think they’re fairly straightforward, but in any case, Alex will be posting the solutions later today if you want to check your own answers.

I say they’re not especially statistical, but there is quite a lot of slightly intricate probability associated with tennis, since live tennis betting is a lucrative market these days. Deciding whether a bet is good value or not means taking the current score and an estimate of the players’ relative abilities, and converting that into a match win probability for either player, which can then be compared against the bookmakers’ odds. But how is that done? The calculations are reasonably elementary, but complicated by both the scoring system and the fact that players tend to be more likely to win a point on serve than return.

If you’re interested, the relevant calculations for all score situations are available in this academic paper, though this assumes players are equally strong on serve and return. It also assumes the outcome of each point is statistically independent from all other points – that’s to say, knowing the outcome of one point doesn’t affect the probability of who wins another point. So, to add to Alex’s 3 questions above, I might add:

Why might tennis points not be statistically independent in practice, and what is the likely effect on match probability calculations of assuming they are when they’re not?

Walking on water

Here’s a question: how do you get dogs to walk on water?

Turns out there’s a really simple answer – just heat the atmosphere up by burning fossil fuels so much that the Greenland ice sheets melt.

The remarkable picture above was taken by a member of the Centre for Ocean and Ice at the Danish Meteorological Institute. Their pre-summer retrieval of research equipment is normally a sledge ride across a frozen winter wasteland; this year it was a paddle through the ocean that’s sitting on what’s left of the ice. And the husky dogs that pull the sledge are literally walking on water.

This graph shows the extent – please note: clever play on words – of the problem…

The blue curve shows the median percentage of Greenland ice melt over the last few decades. There’s natural year-to-year variation around that average, and as with any statistical analysis, it’s important to understand what types of variations are normal before deciding whether any particular observation is unusual or not. So, in this case, the dark grey area shows the range of values were observed in 50% of years; the light grey area is what was observed in 90% of years. So, you’d only expect observations outside the light grey area once every ten years. Moreover, the further an observation falls outside of the grey area, the more anomalous it is.

Now, look at the trace for 2019 shown in red. The value for June isn’t just outside the normal range of variation, it’s way outside. And it’s not only an unusually extreme observation for June; it would be extreme for the hottest part of the year in July. At it’s worst (so far), the melt for June 2019 reached over 40%, whereas the average in mid-July is around 18%, with a value of about 35% being exceeded only once in every 10 years.

So, note how much information can be extracted from a single well-designed graph. We can see:

  1. The variation across the calendar of the average ice melt;
  2. The typical variation around the average – again across the calendar – in terms of an interval expected to contain the true value on 50% of occasions: the so-called inter-quartile range;
  3. A more extreme measure of variation, showing the levels that are exceeded only once every 10 years: the so-called inter-decile range;
  4. The trace of an individual year – up to current date – which appears anomalous.

In particular, by showing us the variation in ice melt both within years and across years we were able to conclude that this year’s June value is truly anomalous.

Now let’s look at another graph. These are average spring temperatures, not for Greenland but for Alaska, where there are similar concerns about ice melt caused by increased atmospheric temperatures.

alaska

Again, there’s a lot of information:

  1. Each dot is an average spring temperature, one per year;
  2. The dots have been coloured: most are black, but the blue and red ones correspond to the ten coldest and hottest years respectively;
  3. The green curve shows the overall trend;
  4. The value for 2019 has been individually identified.

And the picture is clear. Not only has the overall trend been increasing since around the mid-seventies, but almost all of the hottest years have occurred in that period, while almost none of the coldest have. In other words, the average spring temperature in Alaska has been increasing over the last 50 years or so, and is hotter now than it has been for at least 90 years (and probably much longer).

Now, you don’t need to be a genius in biophysics to understand the cause and effect relating temperature and ice. So the fact that extreme ice melts are occurring in the same period as extreme temperatures is hardly surprising. What’s maybe less well-known is that the impact of these changes has a knock-on effect way beyond the confines of the Arctic.

So, even if dogs walking on the water of the arctic oceans seems like a remote problem, it’s part of a chain of catastrophic effects that will soon affect our lives too. Statistics has an important role to play in determining and communicating the presence and cause of these effects, and the better we all are at understanding those statistics, the more likely we will be able to limit the damage that is already inevitable. Fortunately, our governments are well aware of this and are taking immediate actions to remedy the problem.

Oh, wait…

… scrap that, better take action ourselves.

First pick

Zion Williamson

If you follow basketball you’re likely to know that the NBA draft was held this weekend, resulting in wonderkid Zion Williamson being selected for New Orleans Pelicans. The draft system is a procedure by which newly available players are distributed among the various NBA teams.

Unlike most team sports at professional level in Europe, the draft system is a partial attempt to balance out teams in terms of the quality of their players. Specifically, teams that do worse one season are given preference when choosing players for the next season. It’s a slightly archaic and complicated procedure – which is shorthand for saying I couldn’t understand all the details from Wikipedia – but the principles are simple enough.

There are 3 stages to the procedure:

  1. A draft lottery schedule, in which teams are given a probability of having first pick, second pick and so on, based on their league position in the previous season. Only teams below a certain level in the league are permitted to have the first pick,  and the probabilities allocated to each team are inversely related to their league position. In particular, the lowest placed teams have the highest probability of getting first pick.
  2. The draft lottery itself, held towards the end of May, where the order of pick selections are assigned randomly to the teams according to the probabilities assigned in the schedule.
  3. The draft selection, held in June, where teams make their picks in the order that they’ve been allocated in the lottery procedure.

In the 2019 draft lottery, the first pick probabilities were assigned as follows:

nbapick

So, the lowest-placed teams, New York, Cleveland and Phoenix, were all given a 14% chance, down to Charlotte, Miami and Sacramento who were given a 1% chance. The stars and other indicators in the table are an additional complication arising from the fact that teams can trade their place in the draw from one season to another.

In the event, following the lottery based on these probabilities, the first three picks were given to New Orleans, Memphis and New York respectively. The final stage in the process was then carried out this weekend, resulting in the anticipated selection of Zion Williamson by the New Orleans Pelicans.

There are several interesting aspects to this whole process from a statistical point of view.

The first concerns the physical aspects of the draft lottery. Here’s an extract from the NBA’s own description of the procedure:

Fourteen ping-pong balls numbered 1 through 14 will be placed in a lottery machine. There are 1,001 possible combinations when four balls are drawn out of 14, without regard to their order of selection. Before the lottery, 1,000 of those 1,001 combinations will be assigned to the 14 participating lottery teams. The lottery machine is manufactured by the Smart Play Company, a leading manufacturer of state lottery machines throughout the United States. Smart Play also weighs, measures and certifies the ping-pong balls before the drawing.

The drawing process occurs in the following manner: All 14 balls are placed in the lottery machine and they are mixed for 20 seconds, and then the first ball is removed. The remaining balls are mixed in the lottery machine for another 10 seconds, and then the second ball is drawn. There is a 10-second mix, and then the third ball is drawn. There is a 10-second mix, and then the fourth ball is drawn. The team that has been assigned that combination will receive the No. 1 pick. The same process is repeated with the same ping-pong balls and lottery machine for the second through fourth picks.

If the same team comes up more than once, the result is discarded and another four-ball combination is selected. Also, if the one unassigned combination is drawn, the result is discarded and the balls are drawn again. The length of time the balls are mixed is monitored by a timekeeper who faces away from the machine and signals the machine operator after the appropriate amount of time has elapsed.

You probably don’t need me to explain how complicated this all is, compared to the two lines of code it would take to instruct the same procedure electronically. Arguably, perhaps, seeing the lottery carried out with the physical presence of ping pong balls might stop people thinking the results had been fixed. Except it doesn’t. So, it’s all just for show. Why do things efficiently and electronically when you can add razzmatazz and generate high tv ratings? Watching a statistician generate the same ratings for a couple of minutes on a laptop maybe just wouldn’t have the same appeal.

Anyway, my real reason for including this topic in the blog is the following. In several previous posts I’ve mentioned the use of simulation as a statistical technique. Applications are varied, but in most cases simulation is used to generate many realisations from a probability model in order to get a picture of what real data are likely to look like if their random characteristics are somehow linked to that probability model. 

For example, in this post I simulated how many packs of Panini stickers would be needed to fill an album. Calculating the probabilities of the number of packs needed to complete an album is difficult, but the simulation of the process of completing an album is easy.

And in a couple of recent posts (here and here) we used simulation techniques to verify what seemed like an easy intuitive result. As it turned out, the simulated results were different from what the theory suggested, and a slightly deeper study of the problem showed that some care was needed in the way the data wee simulated. But nonetheless, the principle of using simulations to investigate the expected outcomes of a random experiment were sound. In each case simulations were used to generate data from a process whose probabilities would have been practically impossible to calculate by other means.

Which brings me to this article, sent to me by Oliver.Cobb@smartodds.co.uk. On the day of the draft lottery, the masterminds at USA Today decided to run 100 simulations of the draft lottery to see which team would get the first pick. It’s mind-numbingly pointless. As Ollie brilliantly put it:

You have to admire the way they’ve based an article on taking a known chance of something happening and using just 100 simulations to generate a less reliable figure than the one they started with.

In case you’re interested, and can’t be bothered with the article, Chicago got selected for first pick most often – 19 times – in the 100 USA Today simulations, and were therefore ‘predicted’ to win the lottery.  But if they’d run their simulations much more often, it’s 100% guaranteed that Chicago wouldn’t have won, but would have been allocated first pick close to the 12.5% of occasions corresponding to their probability in the table above. With enough simulations, the simulated game would always be won by one of New York, Cleveland or Phoenix, whose proportions would only be separated by small amounts due to random variation.

The only positive thing you can say about the USA Today article, is that at least they had the good sense not to do the simulation with 14 actual ping pong balls. As they say themselves:

So to celebrate one of the most cruel and unusual days in sports, we ran tankathon.com’s NBA draft lottery simulator 100 times to predict how tonight will play out. There’s no science behind this. We literally hit “sim lottery” 100 times and wrote down the results.

I especially like the “there’s no science behind this” bit.  Meantime, if you want to create your own approximation to a known set of probabilities, you too can hit the “sim lottery” button 100 times here.


Update: Benoit.Jottreau@Smartodds.co.uk pointed me at this article, which is relevant for two reasons. First, in terms of content. In previous versions of the lottery system, there was a stronger incentive in terms of probability assignments for teams to do badly in the league. This led to teams ‘tanking’: deliberately throwing games towards the end of a season when they knew they were unlikely to reach the playoffs, thereby improving their chances of getting a better player in the draft for the following season. The 2019 version of the lottery aims to reduce this effect, by giving teams less of an incentive to be particularly poor. For example, the lowest three teams in the league now share the highest probability of first pick in the draft, whereas previously the lowest team had a higher probability than all others. But the article Benoit sent me suggests that the changes are unlikely to have much of an impact. It concludes:

…it seems that teams that want to tank still have strong incentives to tank, even if the restructured NBA draft lottery makes it less likely for them to receive the best picks.

The other reason why this article is relevant is that it makes much more intelligent use of simulation as a technique than the USA Today article referred to above.

Revel in the amazement

In an earlier post I included the following table:

As I explained, one of the columns contains the genuine land areas of each country, while the other is fake. And I asked you which is which.

The answer is that the first column is genuine and the second is fake. But without a good knowledge of geography, how could you possibly come to that conclusion?

Well, here’s a remarkable thing. Suppose we take just the leading digit of each  of the values. Column 1 would give 6, 2, 2, 1,… for the first few countries, while column 2 would give 7, 9, 3, 3,… It turns out that for many naturally occurring phenomena, you’d expect the leading digit to be 1 on around 30% of occasions. So if the actual proportion is a long way from that value, then it’s likely that the data have been manufactured or manipulated.

Looking at column 1 in the table, 5 out of the 20 countries have a population with leading digit 1; that’s 25%. In column 2, none do; that’s 0%. Even 25% is a little on the low side, but close enough to be consistent with 30% once you allow for discrepancies due to random variations in small samples. But 0% is pretty implausible. Consequently, column 1 is consistent with the 30% rule, while column 2 is not, and we’d conclude – correctly – that column 2 is faking it.

But where does this 30% rule come from? You might have reasoned that each of the digits 1 to 9 were equally likely – assuming we drop leading zeros – and so the percentage would be around 11% for a leading digit of 1, just as it would be for any of the other digits. Yet that reasoning turns out to be misplaced, and the true value is around 30%.

This phenomenon is a special case of something called Benford’s law, named after the physicist Frank Benford who first formalised it. (Though it had also been noted much earlier by the astronomer Simon Newcomb). Benford’s law states that for many naturally occurring datasets, the probability that the leading digit of a data item is 1 is equal to 30.1%. Actually, Benford’s law goes further than that, and gives the percentage of times you’d get a 2 or a 3 or any of the digits 1-9 as the leading digit. These percentages are shown in the following table.

Leading Digit 1 2 3 4 5 6 7 8 9
Frequency 30.1% 17.6% 12.5% 9.7% 7.9% 6.7% 5.8% 5.1% 4.6%

For those of you who care about such things, these percentages are log(2/1), log(3/2), log(4/3) and so on up to log(10/9), where log here is logarithm with respect to base 10.

But does Benford’s law hold up in practice? Well, not always, as I’ll discuss below. But often it does. For example, I took a dataset giving the altitudes of a large set of football stadiums around the world. I discarded a few whose altitude is below sea level, but was still left with over 13,000 records. I then extracted the leading digit of each of the altitudes (in metres)  and plotted a histogram of these values. This is just a plot of the percentages of occasions each value occurred. These are the blue bars in the following diagram. I then superimposed the predicted proportions from Benford’s law. These are the black dots.

 

The agreement between the observed percentages and those predicted by Benford’s law is remarkable. In particular, the observed percentage of leading digits equal to 1 is almost exactly what Benford’s law would imply. I promise I haven’t cheated with the numbers.

As further examples, there are many series of mathematically generated numbers for which Benford’s law holds exactly.

These include:

  • The Fibonacci series: 1, 1, 2, 3, 5, 8, 13, …. where each number is obtained by summing the 2 previous numbers in the series.
  • The integer powers of two: 1, 2, 4, 8, 16, 32, …..
  • The iterative series obtained by starting with any number and successively multiplying by 3. For example, starting with 7, we get: 7, 21, 63, 189,….

In each of these cases of infinite series of numbers, exactly 30.1% will have leading digit equal to 1; exactly 17.6% will have leading digit equal to 2, and so on.

And there are many other published examples of data fitting Benford’s law (here, here, here… and so on.)

Ok, at this point you should pause to revel in the amazement of this stuff. Sometimes mathematics, Statistics and probability come together in a way to explain naturally occurring phenomena that is so surprising and shockingly elegant it takes your breath away.

So, when does Benford’s law work. And why?

It turns out there are various ways of explaining Benford’s law, but none of them – at least as far as I can tell – is entirely satisfactory. All of them require a leap of faith somewhere to match the theory to real-life. This view is similarly expressed in an academic article, which concludes:

… there is currently no unified approach that simultaneously explains (Benford’s law’s) appearance in dynamical systems, number theory, statistics, and real-world data.

Despite this, the various arguments used to explain Benford’s law do give some insight into why it might arise naturally in different contexts:

  1. If there is a law of this type, Benford’s law is the only one that works for all choices of scale. The decimal representation of numbers is entirely arbitrary, presumably deriving from the fact that humans, generally, have 10 fingers. But if we’d been born with 8 fingers, or chosen to represent numbers anyway in binary, or base 17, or something else, you’d expect a universal law to be equally valid, and not dependent on the arbitrary choice of counting system. If this is so, then it turns out that Benford’s law, adapted in the obvious way to the choice of scale, is the only one that could possibly hold. An informal argument as to why this should be so can be found here.
  2. If the logarithm of the variable under study has a distribution that is smooth and roughly symmetric – like the bell-shaped normal curve, for example – and is also reasonably well spread out, it’s easy to show that Benford’s law should hold approximately. Technically, for those of you who are interested, if X is the thing we’re measuring, and if log X has something like a normal distribution with a variance that’s not too small, then Benford’s law is a good approximation for the behaviour of X. A fairly readable development of the argument is given here. (Incidentally, I stole the land area of countries example directly from this reference.)

But in the first case, there’s no explanation as to why there should be a universal law, and indeed many phenomena – both theoretical and in nature – don’t follow Benford’s law. And in the second case, except for special situations where the normal distribution has some kind of theoretical justification as an approximation, there’s no particular reason why the logarithm of the observations should behave in the required way. And yet, in very many cases – like the land area of countries or the altitude of football stadiums – the law can be shown empirically to be a very good approximation to the truth.

One thing which does emerge from these theoretical explanations is a better understanding of when Benford’s law is likely to apply and when it’s not. In particular, the argument only works when the logarithm of the variable under study is reasonably well spread out. What that means in practice is that the variable itself needs to cover several orders of magnitude: tens, hundreds, thousands etc. This works fine for something like the stadium altitudes, which vary from close to sea-level up to around 4,000 metres, but wouldn’t work for total goals in football matches, which are almost always in the range 0 to 10, for example.

So, there are different ways of theoretically justifying Benford’s law, and empirically it seems to be very accurate for different datasets which cover orders of magnitude. But does it have any practical uses? Well, yes: applications of Benford’s law have been made in many different fields, including…

Finally, there’s also a version of Benford’s law for the second digit, third digit and so on. There’s an explanation of this extension in the Wikipedia link that I gave above. It’s probably not easy to guess exactly what the law might be in these cases, but you might try and guess how the broad pattern of the law changes as you move from the first to the second and to further digits.


Thanks to those of you wrote to me after I made the original post. I don’t think it was easy to guess what the solution was, and indeed if I was guessing myself, I think I’d have been looking for a uniformity in the distribution of the digits, which turns out to be completely incorrect, at least for the leading digit. Even though I’ve now researched the answer myself, and made some sense of it, I still find it rather shocking that the law works so well for an arbitrary dataset like the stadium altitudes. Like I say: revel in the amazement.

Statty night

Apologies for the terrible pun in the title.

When I used to teach Statistics I tried to emphasise to students that Statistics is as much an art as a science. Statisticians are generally trying to make sense of some aspect of the world, and they usually have just some noisy data with which to try to do it. Sure, there are algorithms and computer packages they can chuck data into and get simple answers out of. But usually those answers are meaningless unless the algorithm/package is properly tailored to the needs of the specific problem. And there are no rules as to how that is best done: it needs a good understanding of the problem itself, an awareness of the data that are available and the creative skill to be able to mesh those things with appropriate statistical tools. And these are skills that are closer to the mindset of an artist than of a scientist.

But anyway… I recently came across the following picture which turns the tables, and uses Statistics to make art. (Or to destroy art, depending on your point of view). You probably recognise the picture at the head of this post as Van Gogh’s Starry Night, which is displayed at MOMA in New York.

By contrast, the picture below is a statistical reinterpretation of the original version of Starry Night, created by photographer Mario Klingemann through a combination of data visualisation and statistical summarisation techniques .

The Starry Night Pie Packed

As you can see, the original painting has been replaced by a collage of coloured circles, which are roughly the same colour as the original painting. But in closer detail, the circles have an interesting structure. Each is actually a pie chart whose slices in size and colour correspond the proportions of colours in that region of the original picture.

Yes, pointless, but kind of fun nonetheless. You can find more examples of Klingemann’s statistically distorted classical artworks here.

In similar vein… the diagram below, produced by artist Arthur Buxton, is actually a quiz. Each of the pie charts represents the proportions of the main colours in one of Van Gogh’s paintings. In other words, these pie charts represent the colour distributions over a whole Van Gogh painting, rather than just a small region of a picture, as in the painting above. The quiz is to identify which Van Gogh painting each of the pie charts refers to.

You can find a short description of Arthur Buxton’s process in developing this picture here.

There’s just a small snag: I haven’t been able to locate the answers. My guess is that the pie chart in column 2 of row 2 corresponds to Starry Night. And the one immediately to the left of that is from the Sunflower series. But that’s pretty much exhausted my knowledge of the works of Van Gogh. Let me know if you can identify any of the others and I’ll add them to a list below.


On the basis of experience with jigsaw puzzles – hey, we’re all on a learning curve and you never know when acquired knowledge will be useful – Nity.Raj@Smartodds.co.uk reliably informs me that the third pie chart from the left on the bottom row will correspond to one of the paintings from Van Gogh’s series of Irises. Looking at this link which Nity gave me it seems entirely plausible.

On top of the world

I’ll be honest, usually I try to find a picture that fits in with the statistical message I’m trying to convey. But occasionally I see a picture and then look for a statistical angle to justify its inclusion in the blog. This is one of those occasions. I don’t know what your mental image of the top of Everest is like, but until now mine wasn’t something that resembled the queue for the showers at Glastonbury.

Anyway, you might have read that this congestion to reach the summit of Everest is becoming increasingly dangerous. In the best of circumstances the conditions are difficult, but climbers are now faced with a wait of several hours at very high altitude with often unpredictable weather. And this has contributed to a spate of recent deaths.

But what’s the statistical angle? Well, suppose you wanted to make the climb yourself. What precautions would you take? Obviously you’d get prepared physically and make sure you had the right equipment. But beyond that, it turns out that a statistical analysis of relevant data, as the following video shows, can both improve your chances of reaching the summit and minimise your chances of dying while doing so.

This video was made by Dr Melanie Windridge, and is one of a series she made under the project title “Summiting the Science of Everest”. Her aim was to explore the various scientific aspects associated with a climb of Everest, which she undertook in Spring 2018. And one of these aspects, as set out in the video, is the role of data analysis in planning. The various things to be learned from the data include:

  1. Climbing from the south Nepal side is less risky than from the north Tibet side. This is explained by the steeper summit on the south side making descent quicker in case of emergency.
  2. Men and women have equally successful at completing summits of Everest. And they also have similar death rates.
  3. Age is a big factor: over forties are less likely to make the summit; over sixties have a much higher death rate.
  4. Most deaths occur in the icefall regions of the mountain.
  5. Many deaths occur during descent.
  6. Avalanches are a common cause of death. Though they are largely unpredictable, they are less frequent in Spring. Moreover, walking through the icefall regions early in the morning also reduces avalanche risk.
  7. The distribution of summit times for climbers who survive is centred around 9 a.m., whereas for those who subsequently die during the descent it’s around 2 p.m. In other words, it’s safest to aim to arrive at the summit relatively early in the morning.

Obviously, climbing Everest will never be risk free – the death rate of people making the summit is, by some counts, around 6.5%. But intelligent use of available data can help minimise the risks. Statistics, in this context, really can be a matter of life or death.

Having said that, although Dr Melanie seemed reassured that the rate of deaths of climbers is decreasing, here’s a graphical representation of the data showing that the actual number of deaths – as opposed to the rate of deaths – is generally increasing with occasional spikes.

Looking on the bright side of things though, Everest is a relatively safe mountain to climb: the death rate for climbers on Annapurna, also in the Himalayas, is around 33%!

In light of all this, if you prefer your climbs to the top of the world to be risk free, you might try scaling the Google face (though I recommend turning the sound off first):

While for less than the prices of a couple of beers you can get a full-on VR experience as previewed below:

Finally, if you’re really interested in the statistics of climbing Everest, there’s a complete database of all attempted climbs available here.

Faking it

 

Take a look at the following table:

fake_data

 

It shows the total land area, in square kilometres, for various countries. Actually, it’s the first part of a longer alphabetical list of all countries and includes two columns of figures, each purporting to be the corresponding area of each country. But one of these columns contains the real areas and the other one is fake. Which is which?

Clearly, if your knowledge of geography is good enough that you know the land area of Belgium – or any of the other countries in the table – or whether Bahrain is bigger than Barbados, then you will know the answer. You could also cheat and check with Google. But you can answer the question, and be almost certain of being correct, without cheating and without knowing anything about geography. Indeed, I could have removed the first column giving the country names, and even not told you that the data correspond to land areas, and you should still have been able to tell me which column is real and which is fake.

So, which column is faking it? And how do you know?

I’ll write a follow-up post giving the answer and explanation sometime soon. Meantime, if you’d like to write to me giving your own version, I’d be happy to hear from you.

 

Freddy’s story: part 2

In a previous post I discussed a problem that Freddy.Teuma@smartodds.co.uk had written to me about. The problem was a simplified version of an issue sent to him by friend, connected with a genetic algorithm for optimisation. Simply stated: you start with £100. You toss a coin and if it comes up tails you lose 25% of your current money, otherwise you gain 25%. You play this game over and over, always increasing or increasing your current money by 25% on the basis of a coin toss. The issue is how much money you expect to have, on average, after 1000 rounds of this game.

As I explained in the original post, Freddy’s intuition was that the average should stay the same at each round. So even after 1000 (or more) rounds, you’d have an average of £100. But when Freddy simulated the process, he always got an amount close to £0, and so concluded his intuition must be wrong.

A couple of you wrote to give your own interpretations of this apparent conflict, and I’m really grateful for your participation. As it turns out, Freddy’s intuition was spot on, and his argument was pretty much a perfect mathematical proof. Let me make the argument just a little bit more precise.

Suppose after n rounds the amount of money you have is M. Then after n+1 rounds you will have (3/4)M if you get a Head and (5/4)M if you get a Tail. Since each of these outcomes is equally probable, the average amount of money after n+1 rounds is

\frac{ (3/4)M + (5/4)M}{2}= M

In other words, exactly as Freddy had suggested, the average amount of money doesn’t change from one round to the next. And since I started with £100, this will be the average amount of money after 1 round, 2 rounds and all the way through to 1000 rounds.

But if Freddy’s intuition was correct, the simulations must have been wrong.

Well, no. I checked Freddy’s code – a world first! – and it was perfect. Moreover, my own implementation displayed the same features as Freddy’s, as shown in the previous post: every simulation has the amount of money decreasing to zero long before 1000 rounds have been completed.

So what explains this contradiction between what we can prove theoretically and what we see in practice?

The following picture shows histograms of the money remaining after a certain number of rounds for each of 100,000 simulations. In the previous post I showed the individual graphs of just 16 simulations of the game; here we’re looking at a summary of 100,000 simulated games.

For example, after 2 rounds, there are only 3 possible outcomes: £56.25, £93.75 and £156.25. You might like to check why that should be so. Of these, £93.75 occurred most often in the simulations, while the other two occurred more or less equally often. You might also like to think why that should be so. Anyway, looking at the values, it seems plausible that the average is around £100, and indeed the actual average from the simulations is very close to that value. Not exact, because of random variation, but very close indeed.

After 5 rounds there are more possible outcomes, but you can still easily convince yourself that the average is £100, which it is. But once we get to 10 rounds, it starts to get more difficult. There’s a tendency for most of the simulated runs to give a value that’s less than £100, but then there are relatively few observations that are quite a bit bigger than £100. Indeed, you can just about see that there is one or more value close to £1000 or so. What’s happening is that the simulated values are becoming much more asymmetric as the number of rounds increases. Most of the results will end up below £100 – though still positive, of course – but a few will end up being much bigger than £100. And the average remains at £100, exactly as the theory says it must.

After 100 rounds, things are becoming much more extreme. Most of the simulated results end up close to zero, but one simulation (in this case) gave a value of around £300,000. And again, once the values are averaged, the answer is very close to £100.

But how does this explain what we saw in the previous post? All of the simulations I showed, and all of those that Freddy looked at, and those his friend obtained, showed the amount of money left being essentially zero after 1000 rounds. Well, the histogram of results after 1000 rounds is a much, much more extreme case of the one shown above for 100 rounds. Almost all of the probability is very, very close to zero. But there’s a very small amount of probability spread out up to an extremely large value indeed, such that the overall average remains £100. So almost every time I do a simulation of the game, the amount of money I have is very, very close to zero. But very, very, very occasionally, I would simulate a game whose result was a huge amount of money, such that it would balance out all of those almost-zero results and give me an answer close to £100. But, such an event is so rare, it might take billions of billions of simulations to get it. And we certainly didn’t get it in the 16 simulated games that I showed in the previous post.

So, there is no contradiction at all between the theory and the simulations. It’s simply that when the number of rounds is very large, the very large results which could occur after 1000 rounds, and which ensure that the average balances out to £100, occur with such low probability that we are unlikely to simulate enough games to see them. We therefore see only the much more frequent games with low winnings, and calculate an average which underestimates the true value of £100.

There are a number of messages to be drawn from this story:

  1. Statistical problems often arise in the most surprising places.
  2. The strategy of problem simplification, solution through intuition, and verification through experimental results is a very useful one.
  3. Simulation is a great way to test models and hypotheses, but it has to be done with extreme care.
  4. And if there’s disagreement between your intuition and experimental results, it doesn’t necessarily imply either is wrong. It may be that the experimental process has complicated features that make results unreliable, even with a large number of simulations.

Thanks again to Freddy for the original problem and the discussions it led to.


To be really precise, there’s a bit of sleight-of-hand in the mathematical argument above. After the first round my expected – rather than actual – amount of money is £100. What I showed above is that the average money I have after any round is equal to the actual amount of money I have at the start of that round. But that’s not quite the same thing as showing it’s equal to the average amount of money I have at the start of the round.

But there’s a famous result in probability – sometimes called the law of iterated expectations – which lets me replace this actual amount at the start of the second round with the average amount, and the result stays the same. You can skip this if you’re not interested, but let me show you how it works.

At the start of the first round I have £100.

Because of the rules of the game, at the end of this round I’ll have either £75 or £125, each with probability 1/2.

In the first case, after the second round, I’ll end up with either £56.25 or £93.75, each with probability 1/2. And the average of these is £75.

In the second case, after the second round, I’ll end up with either £93.75 or £125.75, each with probability 1/2. And the average of these is £125.

And if I average these averages I get £100. This is the law of iterated expectations at work. I’d get exactly the same answer if I averaged the four possible 2-round outcomes: £56.25, £93.75 (twice) and £125.75.

Check:

\frac{56.25 + 93.75 + 93.75 + 125.75}{4} = 100

So, my average after the second round is equal to the average after the first which was equal to the initial £100.

The same argument also applies at any round: the average is equal to the average of the previous round. Which in turn was equal to the average of the previous round. And so on, telescoping all the way back to the initial value of £100.

So, despite the sleight-of-hand, the result is actually true, and this is precisely what Freddy had hypothesised. As explained above, his only ‘mistake’ was to observe that a small number of simulations suggested a quite different behaviour, and to assume that this meant his mathematical reasoning was wrong.

 

Midrange is dead

Kirk Goldsberry is the author of a new book on data analytics for NBA. I haven’t read the book, but some of the graphical illustrations he’s used for its publicity are great examples of the way data visualization techniques can give insights about the evolution of a sport in terms of the way it is played.

 

Press the start button in the graphic of the above tweet.. I’m not sure exactly how the graphic and the data are mapped, but essentially the coloured hexagons show regions of the basketball court which are the most frequent  locations for taking shots. The animation shows how this pattern has changed over the seasons.

As you probably know, most goals in basketball – excluding penalty shots – are awarded 2 points. But a shot that’s scored from outside a distance of 7.24m from the basket – the almost semi-circular outer-zone shown in the figure – scores 3 points. So, there are two ways to improve the number of points you are likely to score when shooting: first, you can get closer to the basket, so that the shot is easier; or second, you can shoot from outside the three-point line, so increasing the number of points obtained when you do score. That means there’s a zone in-between, where the shot is still relatively difficult because of the distance from the basket, but for which you only get 2 points when you do score. And what the animation above clearly shows is an increasing tendency over the seasons for players to avoid shooting from this zone. This is perhaps partly because of a greater understanding of the trade-off between difficulty and distance, and perhaps also because improved training techniques have led to a greater competency in 3-point shots.

Evidence to support this reasoning is the following data heatmap diagram which shows the average number of points scored from shots taken at different locations on the court. The closer to red, the higher the average score per shot.

Again the picture makes things very clear: average points scored are highest when shooting from very close to the basket, or from outside of the 3-point line. Elsewhere the average is low. It’s circumstantial evidence, but the fact that this map of points scored has patterns that are so similar to the current map of where players are shooting from, there’s a strong suggestion that players have evolved their play style in order to shoot at the basket from positions which they know are more likely to generate the most points.

In summary, creative use of both static and animated graphical data representations provide great insights about the way basketball play has evolved, and why that evolution is likely to have occurred, given the 3-point shooting rule.


Thanks to Benoit.Jottreau@smartodds.co.uk for posting something along these lines on RocketChat.