# Here’s to all the money at the end of the world

I made the point in last week’s Valentine’s Day post, that although the emphasis of this blog is about the methodology of using Statistics to understand the world through the analysis of data, it’s often the case that statistics in themselves tell their own story. In this way we learnt that a good proportion of the population of the UK buy their pets presents for Valentine’s Day.

As if that wasn’t bad enough, I now have to report to you the statistical evidence for the fact that nature itself is dying. Or as the Guardian puts it:

Plummeting insect numbers threaten collapse of nature’

The statistical and scientific evidence now points to the fact that, at current rates of decline, all insects could be extinct by the end of the century. Admittedly, it’s probably not great science or statistics to extrapolate the current annual loss of 2.5% in that way, but nevertheless it gives you a picture of the way things are going. This projected elimination of insects would be, by some definitions, the sixth mass extinction event on earth. (Earlier versions wiped out dinosaurs and so on).

And before you go all Donald Trump, and say ‘bring it on: mosquito-free holidays’, you need to remember that life on earth is a complex ecological system in which the big things (including humans) are indirectly dependent on the little things (including insects) via complex bio-mechanisms for mutual survival. So if all the insects go, all the humans go too. And this is by the end of the century, remember.

Here’s First Dog on the Moon’s take on it:

So, yeah, let’s do our best to make money for our clients. But let’s also not forget that money only has value if we have a world to spend it in, and use Statistics and all other means at our disposal to fight for the survival of our planet and all the species that live on it.

# Famous statisticians: Sir Francis Galton

This is the second in a so-far very short series on famous statisticians from history. You may remember that the first in the series was on John Tukey. As I said at that time, rather than just include statisticians randomly in this series, I’m going to focus on those who have had an impact beyond the realm of just statistics.

With that in mind, this post is about Sir Francis Galton (1822-1911), an English statistician who did most of his work in the second half of the 19th century, around the time that Statistics was being born as a viable scientific discipline.

You may remember seeing Galton’s name recently. In a recent post on the bean machine, I mentioned that the device also goes under the name of ‘Galton board’. This is because Galton was the inventor of the machine, which he used to illustrate the Central Limit Theorem, as discussed in the earlier post. You may also remember an earlier post in which I discussed regression to the mean’; Galton was also the first person to explore and describe this phenomenon, as well as the more general concept of correlation to describe the extent to which two random phenomena are connected.

It’s probably no coincidence that Galton was a half-cousin of Charles Darwin, since much of Galton’s pioneering work was on the way statistics could be used to understand genetic inheritance and human evolution. Indeed, he is the inventor of the term eugenics, which he coined during his attempts to understand the extent to which intelligence is inherited, rather than developed.

Galton is described in Wikipedia as:

• A statistician
• A progressive
• A polymath
• A sociologist
• A psychologist
• An anthropologist
• A eugenicist
• A tropical explorer
• A geographer
• An inventor
• A meteorologist
• A proto-geneticist
• A psychometrician

And you thought you were busy. Anyway, it’s fair to say that Galton falls in my category of statisticians who have done something interesting with their lives outside of Statistics.

His various contributions apart from those mentioned above include:

1. He invented the use of weather maps for popular use;
2. He wrote a book ‘The Art of Travel’ which offered practical travel advice to Victorians;
3. He was the first to propose the use of questionnaires as a means of data collection;
4. He conceived the notion of standard deviation as a way of summarising the variation in data;
5. He devised a technique called composite portraiture which was an early version of photoshop for making montages of photographic portraits;
6. He pretty much invented the technique of fingerprinting for identifying  individuals by their fingerprints.

In summary, many of the things Galton worked on or invented are still relevant today. And this is just as true for his non-statistical contributions, as for his statistical ones. Of course, it’s an unfortunate historical footnote that his theory of eugenics – social engineering to improve biological characteristics in populations – was adopted and pushed to extremes in Nazi Germany, with unthinkable consequences.

In retrospect, it’s a pity he didn’t just stop once he’d invented the bean machine.

# Happy Valentine’s Day

Happy Valentine’s Day. In case you didn’t get any cards or gifts today, please know that Smartodds loves Statistics loves you.

Anyway, I thought it might be interesting to research some statistics about Valentine’s day, and found this article, from which I learned much more about the population of Britain than I was expecting to.

Here are some of the highlights:

1. A significant number of people spend money for Valentine’s day on their pets. This number varies per generation, and is as high as 8.7% for millennials.
2. A slightly smaller, but still significant,  number of people spend money on themselves for Valentine’s. Again, this trend is most prevalent among millennials, and also more common for women than men.
3. 36.2% of people get unwanted gifts most years.
4. 19% of people plan to celebrate Valentine’s late in order to save money by buying cards and gifts once the prices have dropped.

I’m not sure which of these statistics I found to be the more shocking.

Most of the posts in this blog are about the way Statistics as a science can be used to investigate problems and interpret data. But sometimes, the statistics are fascinating in themselves, and don’t require any kind of mathematical sophistication to reveal the secrets that they contain.

Anyway, I have to run now to buy myself my girlfriend a gift

Happy Valentine’s…

# Dance, dance, dance…

Ever thought: ‘I’m pretty sure I would fully understand Statistics, if only a modern dance company would illustrate the techniques for me’?

I hope you get the idea of what I’m trying to do with this blog by now. Fundamentally, Statistics is a very intuitive subject, but that intuition is often masked by technicalities, so that from the outside the subject can seem both boring and impenetrable. The aim of all of my posts is to try to show that neither of those things is true: Statistics is both fascinating and easily understandable. And in this way, whatever your connection to Smartodds, you’ll be better equipped to understand the statistical side of the company’s operations.

Of course, I’m not the only person to try to de-mystify Statistics, and there are many books, blogs and other learning aids with similar aims.

With this in mind, I recently came across a rather unusual set of resources for learning Statistics: a series of dance videos whose aim is to explain statistical concepts through movement. Probably my ‘favourite’ is this one, which deals with the notions of sampling and standard error. You might like to take a look…

I think it fair to say that the comments on these videos on YouTube are mixed. One person wrote:

This way it makes complicated things look simpler. Very informative and useful. Loved it. 🙂

While another said:

this makes simple things look complicated but thanks anyway

So, I guess it depends on your perspective. I think I’m on the side of the latter commenter though: I’m pretty sure that in 5 minutes I could give a much clearer and more entertaining explanation of the issues this film is trying to address than the film does itself. But maybe that’s not the point. Perhaps the point is that different things hook different people in, and while personally I can’t think of a much more complicated way of thinking about issues of sampling and measuring accuracy, the dance perspective seems to work for some people.

Anyway, if you think this might be the key to help you unlock some of the mysteries of Statistics, you can find the full series of four videos here, covering topics like correlation and standard deviation. Enjoy.

# The bean machine

Take a look at the following video…

It shows the operation of a mechanical device that is variously known as a bean machine, a quincunx or a Galton board. When the machine is flipped, a large number of small balls or beans fall through a funnel at the top of the device. Below the funnel is a layered grid of pegs. As each bean hits a peg it can fall left or right – with equal probability if the board is carefully made – down to the next layer, where it hits another peg and can again go left or right. This repeats for a number of layers, and the beans are then collected in groups, according to the position they fall in the final layer. At the end you get a kind of physical histogram, where the height of the column of beans corresponds to the frequency with which the beans have fallen in that slot.

Remarkably, every time this experiment is repeated, the pattern of beans at the bottom is pretty much the same: it’s symmetric, high in the middle, low at the edges and has a kind of general bell-shape. In fact, the shape of this histogram will be a good approximation to the well-known normal distribution curve:

As you probably know, it turns out that the relative frequencies of many naturally occurring phenomena look exactly like this normal curve: heights of plants, people’s IQ, brightness of stars…. and indeed (with some slight imperfections) the differences in team points in sports like basketball.

Anyway, if you look at the bottom of the bean machine at the end of the video, you’ll see that the heights of the columns of beans – which in itself represents the frequency of beans falling in each position – resembles this same bell-shaped curve. And this will happen – with different small irregularities – every time the bean machine is re-started.

Obviously, just replaying the video will always lead to identical results, so you’ll have to take my word for it that the results are similar every time the machine is operated. There are some simulators available, but my feeling is you lose something by not seeing the actual physics of real-world beans falling into place. Take a look here if you’re interested, though I suggest you crank the size and speed buttons up to their maximum values first.

But why should it be that the bean machine, like many naturally occurring phenomena, leads to frequencies that closely match the normal curve?

Well, the final position of each bean is the result of several random steps in which the bean could go left or right. If we count +1 every time the bean goes right and -1 every time the bean goes left, then the final position is the sum of these random +/-1 outcomes. And it turns out, that under fairly general conditions, that whenever you have a process that is the sum of several random experiments, the final distribution is bound to look like this bell-shaped normal curve.

This is a remarkable phenomenon. The trajectory of any individual bean is unpredictable. It could go way to the left, or way to the right, though it’s more likely that it will stay fairly central. Anything is possible, though some outcomes are more likely than others. However, while the trajectory of individual beans is unpredictable, the collective behaviour of several thousand beans is entirely predictable to a very high degree of accuracy: the frequencies within any individual range will match very closely the values predicted by the normal distribution curve. This is really what makes statistics tick. We can predict very well how a population will behave, even if we can’t predict how individuals will behave.

Even more remarkably, if the bean machine has enough layers of pegs, the eventual physical histogram of beans will still look like the normal distribution curve, even if the machine has some sort of bias. For example, suppose the beans were released, but that the machine wasn’t quite vertical, so that the beans had a higher tendency to go left, rather than right, when they hit a peg. In this case, as long as there were sufficiently many layers of pegs, the final spread of beans would still resemble the normal curve, albeit no longer centred at the middle of the board. You can try this in the simulator by moving the left/right button away from 50%.

Technically, the bean machine is a physical illustration of a mathematical result generally termed the Central Limit Theorem. This states that in situations like those illustrated by the bean machine, where a phenomenon can be regarded as a sum of random experiments, then under general conditions the distribution of final results will look very much like the well-known bell-shaped normal curve.

It’s difficult to overstate the importance of this result – which is fundamental to almost all areas of statistical theory and practice – since it lets us handle probabilities in populations, even when we don’t know how individuals behave. And the beauty of the bean machine is that it demonstrates that the Central Limit Theorem is meaningful in the real physical world, and not just a mathematical artefact.

Can’t live without your own desktop bean machine? I have good news for you…

# Groundhog day

Fed up of the cold, snow and rain? Don’t worry, spring is forecast to be here earlier than usual. Two caveats though:

1. ‘Here’ is some unspecified region of the United States, and might not extend as far as the UK;
2. This prediction was made by a rodent.

Yes, Saturday (February 2nd) was Groundhog Day in the US. And since Punxsutawney Phil failed to see his shadow, spring is forecast to arrive early.

You probably know about Groundhog Day from the Bill Murray movie

… but it’s actually a real event. It’s celebrated in many locations of the US and Canada, though it’s the event in Punxsutawney, Pennsylvania, which has become the most famous, and around which the movie was based. As Wikipedia says:

The Groundhog Day ceremony held at Punxsutawney in western Pennsylvania, centering around a semi-mythical groundhog named Punxsutawney Phil, has become the most attended.

Semi-mythical, no less. If you’d like to know more about Punxsutawney Phil, there’s plenty of information at The Punxsutawney Groundhog Club website, including a dataset of his predictions. These include the entry from 1937 when Phil had an ‘unfortunate meeting with a skunk’. (And whoever said data analysis was boring?)

Anyway, the theory is that if, at 7.30 a.m. on the second of February, Phil the groundhog sees his shadow, there will be six more weeks of winter; if not, spring will arrive early. Now, it seems a little unlikely that a groundhog will have powers of meteorological prediction, but since the legend has persisted, and there is other evidence of animal behaviour serving as a weather predictor,  it seems reasonable to assess the evidence.

Disappointingly, Phil’s success rate is rather low. This article gives it as 39%. I’m not sure if it’s obvious or not, but the article also states (correctly) that if you were to guess randomly, by tossing a coin, say, then your expected chance of guessing correctly is 50%. The reason I say it might not be obvious, is because the chance of spring arriving early is unlikely to be 50%. It might be 40%, say. Yet, randomly guessing with a coin will still have a 50% expected success rate. As such, Phil is doing worse than someone who guesses at random, or via coin tossing.

However, if Phil’s 39% success rate is a genuine measure of his predictive powers – rather than a reflection of the fact that his guesses are also random, and he’s just been a bit unlucky over the years – then he’s still a very useful companion for predictive purposes. You just need to take his predictions, and predict the opposite. That way you’ll have a 61% success rate – rather better than random guessing. Unfortunately, this means you will have to put up with another 6 weeks of winter.

Meantime, if you simply want more Groundhog Day statistics, you can fill your boots here.

And finally, if you think I’m wasting my time on this stuff, check out the Washington Post who have done a geo-spatial analysis of the whole of the United States to colour-map the regions in which Phil has been respectively more and less successful with his predictions over the years.

🤣

# Groundhog day

Fed up of the cold, snow and rain? Don’t worry, spring is forecast to be here earlier than usual. Two caveats though:

1. ‘Here’ is some unspecified region of the United States, and might not extend as far as the UK;
2. This prediction was made by a rodent.

Yes, Saturday (February 2nd) was Groundhog Day in the US. And since Punxsutawney Phil failed to see his shadow, spring is forecast to arrive early.

You probably know about Groundhog Day from the Bill Murray movie

… but it’s actually a real event. It’s celebrated in many locations of the US and Canada, though it’s the event in Punxsutawney, Pennsylvania, which has become the most famous, and around which the movie was based. As Wikipedia says:

The Groundhog Day ceremony held at Punxsutawney in western Pennsylvania, centering around a semi-mythical groundhog named Punxsutawney Phil, has become the most attended.

Semi-mythical, no less. If you’d like to know more about Punxsutawney Phil, there’s plenty of information at The Punxsutawney Groundhog Club website, including a dataset of his predictions. These include the entry from 1937 when Phil had an ‘unfortunate meeting with a skunk’. (And whoever said data analysis was boring?)

Anyway, the theory is that if, at 7.30 a.m. on the second of February, Phil the groundhog sees his shadow, there will be six more weeks of winter; if not, spring will arrive early. Now, it seems a little unlikely that a groundhog will have powers of meteorological prediction, but since the legend has persisted, and there is other evidence of animal behaviour serving as a weather predictor,  it seems reasonable to assess the evidence.

Disappointingly, Phil’s success rate is rather low. This article gives it as 39%. I’m not sure if it’s obvious or not, but the article also states (correctly) that if you were to guess randomly, by tossing a coin, say, then your expected chance of guessing correctly is 50%. The reason I say it might not be obvious, is because the chance of spring arriving early is unlikely to be 50%. It might be 40%, say. Yet, randomly guessing with a coin will still have a 50% expected success rate. As such, Phil is doing worse than someone who guesses at random, or via coin tossing.

However, if Phil’s 39% success rate is a genuine measure of his predictive powers – rather than a reflection of the fact that his guesses are also random, and he’s just been a bit unlucky over the years – then he’s still a very useful companion for predictive purposes. You just need to take his predictions, and predict the opposite. That way you’ll have a 61% success rate – rather better than random guessing. Unfortunately, this means you will have to put up with another 6 weeks of winter.

Meantime, if you simply want more Groundhog Day statistics, you can fill your boots here.

And finally, if you think I’m wasting my time on this stuff, check out the Washington Post who have done a geo-spatial analysis of the whole of the United States to colour-map the regions in which Phil has been respectively more and less successful with his predictions over the years.

# The benefit of foresight

Ok, I’m going to be honest… I’m not really happy with this post. I keep deleting it and re-writing it, but can’t get it in a form where it eloquently says what I want it to say. (Insert your own <like all of your other posts> joke here).

I’m trying to say the following things:

1. Trading in sports – or any field – is about predicting what will happen in the future;
2. Data are a summary of the past. If the future behaves like the past, then the data are likely to be useful; if it doesn’t, they’re likely to be less useful;
3. There is often information about the way things are likely to change in the future that’s external to, and not included in, data;
4. This means that predictions for sports trading based on statistical procedures will always be improved by the inclusion of additional knowledge and information that is provided by experts.

That’s what the rest of this post is trying to say. Unfortunately, it’s an admission of a poor post that I’m having to tell you this in advance, rather than letting you draw these conclusions yourself.

Anyway…

It’s often said that ‘with the benefit of hindsight, things could have been done better’. But since hindsight isn’t available when trading on sports, the best we can do is make optimal use of foresight.

This season has been a record-breaker for the NFL. Among other tumbling records, at 1371, the number of touchdowns in the regular season is the largest in the league’s 99-year history.

Of course, random variation means records will be broken from time to time just by chance, but if this sudden increase in points was actually predictable, then bets placed on NFL would have been improved if they had taken this into account.

Naturally, as statisticians, our primary source of evidence is contained in data, and we aim to exploit basic patterns and trends in data to help make predictions for the future. But data are by definition a snapshot of the past, and the models we develop will only work well if the future behaves like the past. Admittedly, if changes have already occurred, these will be encapsulated in data, and can be extrapolated into predictive models for the future. But data do not, in themselves, describe mechanisms of change.  And it will always be essential to use additional sources of information and knowledge, not contained in data, to temper, inform and modify predictions from data-based statistical models.

With all that in mind, I found this article an interesting read. It provides a chronology of events connected to the NFL, all of which have contributed one way or another to the current attack-based tendency of play. The foresight to use this knowledge at the start of the season, to modify predictions to account for a likely increase in points due to a greater emphasis on attack, would almost certainly have led to better predictions than those provided by using data-based models only.

# Stickers

Last year’s Fifa© world cup Panini sticker album had spaces for 682 stickers. Stickers were sold in packs of 5, at a cost of 80 pence per pack. How much was it likely to cost to fill the whole album? Maybe have a guess at this before moving on.

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

Well, to get 682 stickers you need 137 packs, so the obvious (but wrong) answer is 137 times 80 pence, which is ￡109.60. It’s wrong, of course, because it doesn’t take into account duplicate stickers: as the album fills up, when you buy a new pack, it’s likely that at least some of the new stickers will be stickers that you’ve already collected. And the more stickers you’ve already collected, the more likely it is that a new pack will contain stickers that you’ve already got. So, you’re likely to need many more than 137 packs and spend much more than ￡109.60. But how much more?

It turns out (see below) that on average the number of packs needed can be calculated as

$(682/682 + 682/681 + 682/680 + \ldots + 682/1) /5 \approx 969$

where the “…” means “plus all the terms in-between”. So the next term in the sequence you have to add is 682/679 and then 682/678 and so on, all the way down to the final term in the sequence which is given as 682/1.

So the average cost of filling the album  is around $969 \times 80$ pence, or £775. You can probably also guess how this calculation changes if the number of spaces in the album were different from 682 or if the number of stickers per pack were different from 5.

Well, actually, there’s a small mistake in this calculation. Strictly speaking, when you buy packs of 5 stickers, none of the stickers in a pack will be duplicates among themselves. The above calculation ignores this fact, and assumes that duplicates could occur within packs. However, it turns out that doing the mathematics more carefully – which is quite a bit more complicated – leads to a not-very-different answer of £773. So, we might have simplified things in our calculation of £775, but we didn’t lose much in terms of accuracy.

Anyway, a question that’s just as interesting as the accuracy of the answer is what the value of £775 means in practice. Though it’s the average value that would be spent by many collectors in filling the album, the actual experience of any individual collector might be quite different from this. The mathematics is more complicated again in this case, but we can avoid the complexity by simulating the process. The figure below shows a histogram of the number of packs needed to fill the album in a simulation of 10,000 albums.

So, for example,  I needed roughly 800 packs to complete the album in around 1500 of the simulated albums. Of course, the average number of packs needed turns out to be close to the theoretical average of 969. But although sometimes fewer than this number were needed, the asymmetry of the histogram means that on many occasions far more than the average number was needed. For example, on a significant number of  occasions more than 1000 packs were needed; on several occasions more than 1500 packs were needed; and on a few occasions more than 2000 packs were needed (at a cost of over £1600!). By contrast, there were no occasions on which 500 packs were sufficient to complete the album. So, even though an average spend of £775 probably sounded like a lot of money to fill the album, any individual collector might need to spend as much as £2000 or more, while all collectors would have need to spend at least £400.

This illustrates an important point about Statistics in general – an average is exactly that: an average. And individual experiences might differ considerably from that average value. Moreover, asymmetry in the underlying probability distribution – as seen in the histogram above – will imply that variations from the average are likely to be bigger in one direction than the other. In the case of Panini sticker albums, you might end up paying a lot more than the average of £775, but are unlikely to spend very much less.

To be fair to Panini, it’s common for collectors to swap duplicate stickers with those of other collectors. This obviously has the effect of reducing the number of packs needed to complete the album. Furthermore, Panini now provide an option for collectors to order up to 50 specific stickers, enabling collectors who have nearly finished the album to do so without buying further packs when the chance of duplication is at its highest. So for both these reasons, the expected costs of completing the album as calculated above are over-estimates. On the other hand, if certain stickers are made deliberately rarer than others, the expected number of packs will increase! Would Panini do that? We’ll discuss that in a future post.

Meantime, for maths enthusiasts, and just in case you’re interested, let’s see where the formula

$(682/682 + 682/681 + 682/680 + \ldots + 682/1) /5 \approx 969$

comes from. You might remember from an earlier post, that if I repeat an experiment that has probability p of success until I get my first success,  I will have to repeat the experiment an average of 1/p times. Well, buying new stickers until I get one that’s different from those I’ve already collected is an experiment of exactly this type, so I can use this result. But as the number of stickers I’ve already collected changes, so does the probability of obtaining a different sticker.

• At the start, I have 0 stickers, so the probability the next sticker will be a new sticker is 682/682, and the expected number of stickers I’ll need till the next new sticker is 682/682. (No surprises there.)
• I will then have 1 sticker, and the probability the next sticker will be a new sticker is 681/682.  So the expected number of stickers I’ll need till the next new sticker is 682/681.
• I will then have 2 different stickers, and the probability the next sticker will be a new sticker is 680/682.  So the expected number of stickers I’ll need till the next new sticker is 682/680.
• This goes on and on till I have 681 stickers and the probability the next sticker will be a new sticker is 1/682.  So the expected number of stickers I’ll need till the next new sticker is 682/1.

At that point I’ll have a complete collection. Adding together all these expected numbers of stickers gives

$(682/682 + 682/681 + 682/680 + \ldots + 682/1)$

But each pack contains 5 stickers, so the expected number of packs I’ll need  is

$(682/682 + 682/681 + 682/680 + \ldots + 682/1) /5 \approx 969$