Here’s to all the money at the end of the world

I made the point in last week’s Valentine’s Day post, that although the emphasis of this blog is about the methodology of using Statistics to understand the world through the analysis of data, it’s often the case that statistics in themselves tell their own story. In this way we learnt that a good proportion of the population of the UK buy their pets presents for Valentine’s Day.

As if that wasn’t bad enough, I now have to report to you the statistical evidence for the fact that nature itself is dying. Or as the Guardian puts it:

Plummeting insect numbers `threaten collapse of nature’

The statistical and scientific evidence now points to the fact that, at current rates of decline, all insects could be extinct by the end of the century. Admittedly, it’s probably not great science or statistics to extrapolate the current annual loss of 2.5% in that way, but nevertheless it gives you a picture of the way things are going. This projected elimination of insects would be, by some definitions, the sixth mass extinction event on earth. (Earlier versions wiped out dinosaurs and so on).

And before you go all Donald Trump, and say ‘bring it on: mosquito-free holidays’, you need to remember that life on earth is a complex ecological system in which the big things (including humans) are indirectly dependent on the little things (including insects) via complex bio-mechanisms for mutual survival. So if all the insects go, all the humans go too. And this is by the end of the century, remember.

Here’s First Dog on the Moon’s take on it:

So, yeah, let’s do our best to make money for our clients. But let’s also not forget that money only has value if we have a world to spend it in, and use Statistics and all other means at our disposal to fight for the survival of our planet and all the species that live on it.

Famous statisticians: Sir Francis Galton

 

 

This is the second in a so-far very short series on famous statisticians from history. You may remember that the first in the series was on John Tukey. As I said at that time, rather than just include statisticians randomly in this series, I’m going to focus on those who have had an impact beyond the realm of just statistics.

With that in mind, this post is about Sir Francis Galton (1822-1911), an English statistician who did most of his work in the second half of the 19th century, around the time that Statistics was being born as a viable scientific discipline.

You may remember seeing Galton’s name recently. In a recent post on the bean machine, I mentioned that the device also goes under the name of ‘Galton board’. This is because Galton was the inventor of the machine, which he used to illustrate the Central Limit Theorem, as discussed in the earlier post. You may also remember an earlier post in which I discussed `regression to the mean’; Galton was also the first person to explore and describe this phenomenon, as well as the more general concept of correlation to describe the extent to which two random phenomena are connected.

It’s probably no coincidence that Galton was a half-cousin of Charles Darwin, since much of Galton’s pioneering work was on the way statistics could be used to understand genetic inheritance and human evolution. Indeed, he is the inventor of the term eugenics, which he coined during his attempts to understand the extent to which intelligence is inherited, rather than developed.

Galton is described in Wikipedia as:

  • A statistician
  • A progressive
  • A polymath
  • A sociologist
  • A psychologist
  • An anthropologist
  • A eugenicist
  • A tropical explorer
  • A geographer
  • An inventor
  • A meteorologist
  • A proto-geneticist
  • A psychometrician

And you thought you were busy. Anyway, it’s fair to say that Galton falls in my category of statisticians who have done something interesting with their lives outside of Statistics.

His various contributions apart from those mentioned above include:

  1. He invented the use of weather maps for popular use;
  2. He wrote a book ‘The Art of Travel’ which offered practical travel advice to Victorians;
  3. He was the first to propose the use of questionnaires as a means of data collection;
  4. He conceived the notion of standard deviation as a way of summarising the variation in data;
  5. He devised a technique called composite portraiture which was an early version of photoshop for making montages of photographic portraits;
  6. He pretty much invented the technique of fingerprinting for identifying  individuals by their fingerprints.

In summary, many of the things Galton worked on or invented are still relevant today. And this is just as true for his non-statistical contributions, as for his statistical ones. Of course, it’s an unfortunate historical footnote that his theory of eugenics – social engineering to improve biological characteristics in populations – was adopted and pushed to extremes in Nazi Germany, with unthinkable consequences.

In retrospect, it’s a pity he didn’t just stop once he’d invented the bean machine.

 

Happy Valentine’s Day

Happy Valentine’s Day. In case you didn’t get any cards or gifts today, please know that Smartodds loves Statistics loves you.

Anyway, I thought it might be interesting to research some statistics about Valentine’s day, and found this article, from which I learned much more about the population of Britain than I was expecting to.

Here are some of the highlights:

  1. A significant number of people spend money for Valentine’s day on their pets. This number varies per generation, and is as high as 8.7% for millennials.
  2. A slightly smaller, but still significant,  number of people spend money on themselves for Valentine’s. Again, this trend is most prevalent among millennials, and also more common for women than men.
  3. 36.2% of people get unwanted gifts most years.
  4. 19% of people plan to celebrate Valentine’s late in order to save money by buying cards and gifts once the prices have dropped.

I’m not sure which of these statistics I found to be the more shocking.

Most of the posts in this blog are about the way Statistics as a science can be used to investigate problems and interpret data. But sometimes, the statistics are fascinating in themselves, and don’t require any kind of mathematical sophistication to reveal the secrets that they contain.

Anyway, I have to run now to buy myself my girlfriend a gift

Happy Valentine’s…

Dance, dance, dance…

Ever thought: ‘I’m pretty sure I would fully understand Statistics, if only a modern dance company would illustrate the techniques for me’?

I hope you get the idea of what I’m trying to do with this blog by now. Fundamentally, Statistics is a very intuitive subject, but that intuition is often masked by technicalities, so that from the outside the subject can seem both boring and impenetrable. The aim of all of my posts is to try to show that neither of those things is true: Statistics is both fascinating and easily understandable. And in this way, whatever your connection to Smartodds, you’ll be better equipped to understand the statistical side of the company’s operations.

Of course, I’m not the only person to try to de-mystify Statistics, and there are many books, blogs and other learning aids with similar aims.

With this in mind, I recently came across a rather unusual set of resources for learning Statistics: a series of dance videos whose aim is to explain statistical concepts through movement. Probably my ‘favourite’ is this one, which deals with the notions of sampling and standard error. You might like to take a look…

I think it fair to say that the comments on these videos on YouTube are mixed. One person wrote:

This way it makes complicated things look simpler. Very informative and useful. Loved it. 🙂

While another said:

this makes simple things look complicated but thanks anyway

So, I guess it depends on your perspective. I think I’m on the side of the latter commenter though: I’m pretty sure that in 5 minutes I could give a much clearer and more entertaining explanation of the issues this film is trying to address than the film does itself. But maybe that’s not the point. Perhaps the point is that different things hook different people in, and while personally I can’t think of a much more complicated way of thinking about issues of sampling and measuring accuracy, the dance perspective seems to work for some people.

Anyway, if you think this might be the key to help you unlock some of the mysteries of Statistics, you can find the full series of four videos here, covering topics like correlation and standard deviation. Enjoy.

 

Groundhog day

Fed up of the cold, snow and rain? Don’t worry, spring is forecast to be here earlier than usual. Two caveats though:

  1. ‘Here’ is some unspecified region of the United States, and might not extend as far as the UK;
  2. This prediction was made by a rodent.

Yes, Saturday (February 2nd) was Groundhog Day in the US. And since Punxsutawney Phil failed to see his shadow, spring is forecast to arrive early.

You probably know about Groundhog Day from the Bill Murray movie

… but it’s actually a real event. It’s celebrated in many locations of the US and Canada, though it’s the event in Punxsutawney, Pennsylvania, which has become the most famous, and around which the movie was based. As Wikipedia says:

The Groundhog Day ceremony held at Punxsutawney in western Pennsylvania, centering around a semi-mythical groundhog named Punxsutawney Phil, has become the most attended.

Semi-mythical, no less. If you’d like to know more about Punxsutawney Phil, there’s plenty of information at The Punxsutawney Groundhog Club website, including a dataset of his predictions. These include the entry from 1937 when Phil had an ‘unfortunate meeting with a skunk’. (And whoever said data analysis was boring?)

Anyway, the theory is that if, at 7.30 a.m. on the second of February, Phil the groundhog sees his shadow, there will be six more weeks of winter; if not, spring will arrive early. Now, it seems a little unlikely that a groundhog will have powers of meteorological prediction, but since the legend has persisted, and there is other evidence of animal behaviour serving as a weather predictor,  it seems reasonable to assess the evidence.

Disappointingly, Phil’s success rate is rather low. This article gives it as 39%. I’m not sure if it’s obvious or not, but the article also states (correctly) that if you were to guess randomly, by tossing a coin, say, then your expected chance of guessing correctly is 50%. The reason I say it might not be obvious, is because the chance of spring arriving early is unlikely to be 50%. It might be 40%, say. Yet, randomly guessing with a coin will still have a 50% expected success rate. As such, Phil is doing worse than someone who guesses at random, or via coin tossing.

However, if Phil’s 39% success rate is a genuine measure of his predictive powers – rather than a reflection of the fact that his guesses are also random, and he’s just been a bit unlucky over the years – then he’s still a very useful companion for predictive purposes. You just need to take his predictions, and predict the opposite. That way you’ll have a 61% success rate – rather better than random guessing. Unfortunately, this means you will have to put up with another 6 weeks of winter.

Meantime, if you simply want more Groundhog Day statistics, you can fill your boots here.

And finally, if you think I’m wasting my time on this stuff, check out the Washington Post who have done a geo-spatial analysis of the whole of the United States to colour-map the regions in which Phil has been respectively more and less successful with his predictions over the years.


🤣

Groundhog day

Fed up of the cold, snow and rain? Don’t worry, spring is forecast to be here earlier than usual. Two caveats though:

  1. ‘Here’ is some unspecified region of the United States, and might not extend as far as the UK;
  2. This prediction was made by a rodent.

Yes, Saturday (February 2nd) was Groundhog Day in the US. And since Punxsutawney Phil failed to see his shadow, spring is forecast to arrive early.

You probably know about Groundhog Day from the Bill Murray movie

… but it’s actually a real event. It’s celebrated in many locations of the US and Canada, though it’s the event in Punxsutawney, Pennsylvania, which has become the most famous, and around which the movie was based. As Wikipedia says:

The Groundhog Day ceremony held at Punxsutawney in western Pennsylvania, centering around a semi-mythical groundhog named Punxsutawney Phil, has become the most attended.

Semi-mythical, no less. If you’d like to know more about Punxsutawney Phil, there’s plenty of information at The Punxsutawney Groundhog Club website, including a dataset of his predictions. These include the entry from 1937 when Phil had an ‘unfortunate meeting with a skunk’. (And whoever said data analysis was boring?)

Anyway, the theory is that if, at 7.30 a.m. on the second of February, Phil the groundhog sees his shadow, there will be six more weeks of winter; if not, spring will arrive early. Now, it seems a little unlikely that a groundhog will have powers of meteorological prediction, but since the legend has persisted, and there is other evidence of animal behaviour serving as a weather predictor,  it seems reasonable to assess the evidence.

Disappointingly, Phil’s success rate is rather low. This article gives it as 39%. I’m not sure if it’s obvious or not, but the article also states (correctly) that if you were to guess randomly, by tossing a coin, say, then your expected chance of guessing correctly is 50%. The reason I say it might not be obvious, is because the chance of spring arriving early is unlikely to be 50%. It might be 40%, say. Yet, randomly guessing with a coin will still have a 50% expected success rate. As such, Phil is doing worse than someone who guesses at random, or via coin tossing.

However, if Phil’s 39% success rate is a genuine measure of his predictive powers – rather than a reflection of the fact that his guesses are also random, and he’s just been a bit unlucky over the years – then he’s still a very useful companion for predictive purposes. You just need to take his predictions, and predict the opposite. That way you’ll have a 61% success rate – rather better than random guessing. Unfortunately, this means you will have to put up with another 6 weeks of winter.

Meantime, if you simply want more Groundhog Day statistics, you can fill your boots here.

And finally, if you think I’m wasting my time on this stuff, check out the Washington Post who have done a geo-spatial analysis of the whole of the United States to colour-map the regions in which Phil has been respectively more and less successful with his predictions over the years.

The benefit of foresight

Ok, I’m going to be honest… I’m not really happy with this post. I keep deleting it and re-writing it, but can’t get it in a form where it eloquently says what I want it to say. (Insert your own <like all of your other posts> joke here).

I’m trying to say the following things:

  1. Trading in sports – or any field – is about predicting what will happen in the future;
  2. Data are a summary of the past. If the future behaves like the past, then the data are likely to be useful; if it doesn’t, they’re likely to be less useful;
  3. There is often information about the way things are likely to change in the future that’s external to, and not included in, data;
  4. This means that predictions for sports trading based on statistical procedures will always be improved by the inclusion of additional knowledge and information that is provided by experts.

That’s what the rest of this post is trying to say. Unfortunately, it’s an admission of a poor post that I’m having to tell you this in advance, rather than letting you draw these conclusions yourself.

Anyway…


It’s often said that ‘with the benefit of hindsight, things could have been done better’. But since hindsight isn’t available when trading on sports, the best we can do is make optimal use of foresight.

This season has been a record-breaker for the NFL. Among other tumbling records, at 1371, the number of touchdowns in the regular season is the largest in the league’s 99-year history.

Of course, random variation means records will be broken from time to time just by chance, but if this sudden increase in points was actually predictable, then bets placed on NFL would have been improved if they had taken this into account.

Naturally, as statisticians, our primary source of evidence is contained in data, and we aim to exploit basic patterns and trends in data to help make predictions for the future. But data are by definition a snapshot of the past, and the models we develop will only work well if the future behaves like the past. Admittedly, if changes have already occurred, these will be encapsulated in data, and can be extrapolated into predictive models for the future. But data do not, in themselves, describe mechanisms of change.  And it will always be essential to use additional sources of information and knowledge, not contained in data, to temper, inform and modify predictions from data-based statistical models.

With all that in mind, I found this article an interesting read. It provides a chronology of events connected to the NFL, all of which have contributed one way or another to the current attack-based tendency of play. The foresight to use this knowledge at the start of the season, to modify predictions to account for a likely increase in points due to a greater emphasis on attack, would almost certainly have led to better predictions than those provided by using data-based models only.

 

 

 

Stickers

 

panini

Last year’s Fifa© world cup Panini sticker album had spaces for 682 stickers. Stickers were sold in packs of 5, at a cost of 80 pence per pack. How much was it likely to cost to fill the whole album? Maybe have a guess at this before moving on.

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|


Well, to get 682 stickers you need 137 packs, so the obvious (but wrong) answer is 137 times 80 pence, which is £109.60. It’s wrong, of course, because it doesn’t take into account duplicate stickers: as the album fills up, when you buy a new pack, it’s likely that at least some of the new stickers will be stickers that you’ve already collected. And the more stickers you’ve already collected, the more likely it is that a new pack will contain stickers that you’ve already got. So, you’re likely to need many more than 137 packs and spend much more than £109.60. But how much more?

It turns out (see below) that on average the number of packs needed can be calculated as

(682/682 + 682/681 + 682/680 + \ldots + 682/1) /5 \approx 969

where the “…” means “plus all the terms in-between”. So the next term in the sequence you have to add is 682/679 and then 682/678 and so on, all the way down to the final term in the sequence which is given as 682/1.

So the average cost of filling the album  is around 969 \times 80 pence, or £775. You can probably also guess how this calculation changes if the number of spaces in the album were different from 682 or if the number of stickers per pack were different from 5.

Well, actually, there’s a small mistake in this calculation. Strictly speaking, when you buy packs of 5 stickers, none of the stickers in a pack will be duplicates among themselves. The above calculation ignores this fact, and assumes that duplicates could occur within packs. However, it turns out that doing the mathematics more carefully – which is quite a bit more complicated – leads to a not-very-different answer of £773. So, we might have simplified things in our calculation of £775, but we didn’t lose much in terms of accuracy.

Anyway, a question that’s just as interesting as the accuracy of the answer is what the value of £775 means in practice. Though it’s the average value that would be spent by many collectors in filling the album, the actual experience of any individual collector might be quite different from this. The mathematics is more complicated again in this case, but we can avoid the complexity by simulating the process. The figure below shows a histogram of the number of packs needed to fill the album in a simulation of 10,000 albums.

geo

So, for example,  I needed roughly 800 packs to complete the album in around 1500 of the simulated albums. Of course, the average number of packs needed turns out to be close to the theoretical average of 969. But although sometimes fewer than this number were needed, the asymmetry of the histogram means that on many occasions far more than the average number was needed. For example, on a significant number of  occasions more than 1000 packs were needed; on several occasions more than 1500 packs were needed; and on a few occasions more than 2000 packs were needed (at a cost of over £1600!). By contrast, there were no occasions on which 500 packs were sufficient to complete the album. So, even though an average spend of £775 probably sounded like a lot of money to fill the album, any individual collector might need to spend as much as £2000 or more, while all collectors would have need to spend at least £400.

This illustrates an important point about Statistics in general – an average is exactly that: an average. And individual experiences might differ considerably from that average value. Moreover, asymmetry in the underlying probability distribution – as seen in the histogram above – will imply that variations from the average are likely to be bigger in one direction than the other. In the case of Panini sticker albums, you might end up paying a lot more than the average of £775, but are unlikely to spend very much less.


To be fair to Panini, it’s common for collectors to swap duplicate stickers with those of other collectors. This obviously has the effect of reducing the number of packs needed to complete the album. Furthermore, Panini now provide an option for collectors to order up to 50 specific stickers, enabling collectors who have nearly finished the album to do so without buying further packs when the chance of duplication is at its highest. So for both these reasons, the expected costs of completing the album as calculated above are over-estimates. On the other hand, if certain stickers are made deliberately rarer than others, the expected number of packs will increase! Would Panini do that? We’ll discuss that in a future post.


Meantime, for maths enthusiasts, and just in case you’re interested, let’s see where the formula

(682/682 + 682/681 + 682/680 + \ldots + 682/1) /5 \approx 969

comes from. You might remember from an earlier post, that if I repeat an experiment that has probability p of success until I get my first success,  I will have to repeat the experiment an average of 1/p times. Well, buying new stickers until I get one that’s different from those I’ve already collected is an experiment of exactly this type, so I can use this result. But as the number of stickers I’ve already collected changes, so does the probability of obtaining a different sticker.

  • At the start, I have 0 stickers, so the probability the next sticker will be a new sticker is 682/682, and the expected number of stickers I’ll need till the next new sticker is 682/682. (No surprises there.)
  • I will then have 1 sticker, and the probability the next sticker will be a new sticker is 681/682.  So the expected number of stickers I’ll need till the next new sticker is 682/681.
  • I will then have 2 different stickers, and the probability the next sticker will be a new sticker is 680/682.  So the expected number of stickers I’ll need till the next new sticker is 682/680.
  • This goes on and on till I have 681 stickers and the probability the next sticker will be a new sticker is 1/682.  So the expected number of stickers I’ll need till the next new sticker is 682/1.

At that point I’ll have a complete collection. Adding together all these expected numbers of stickers gives

(682/682 + 682/681 + 682/680 + \ldots + 682/1)

But each pack contains 5 stickers, so the expected number of packs I’ll need  is

(682/682 + 682/681 + 682/680 + \ldots + 682/1) /5 \approx 969

Nokia 3310

Whatever happened to the Nokia 3310, and what’s that got to do with sports data?

Many of you will know Rasmus Ankerson from his involvement with both Brentford and Midtjylland. Maybe you’ve also seen this video of a TED talk Rasmus gave a while back, but I’ve only just come across it. I think it’s interesting because there are now plenty of articles, books and – ahem – blogs, which emphasise the potential for statistics and data analytics in both sports and gambling. But Rasmus’s talk here goes in the other direction and argues that since data analytics has been proven as a valuable tool to assist gambling on sports, there are lessons that can be learned for leaders of business and industry. The main themes are

  1. In any process where there’s an element of chance, it’s important to recognise that good and bad results are not just a function of good and bad performance, but also of good and bad luck;
  2. There are potentially huge gains in trying to identify the aspects of performance that determine either good or bad results, notwithstanding the interference effects of luck.

In other words, businesses, like football teams, have results that are part performance-driven and part luck. Signal and noise, if you like. Rasmus argues that good business, like good football management, is about identifying what it is that determines the signal, while mitigating for the noise. And only by adopting this strategy can companies, like Nokia, avoid the type of sudden death that happened to the 3310. Or as Rasmus puts it: “RIP  at gadgets graveyard”.

Anyway, Rasmus’s talk is a great watch, partly because of the message it sends about the importance of Statistics to both sport and industry, but also because it includes something about the history of the relationship between Smartodds, Brentford and Midtjylland. Enjoy.

The gene genie

One of the most remarkable advances in scientific understanding over the last couple of hundred years has been Mendelian genetics. This theory explains the basics of genetic inheritance, and is named after its discoverer, Gregor Mendel, who developed the model based on observations of the characteristics of peas when cross-bred from different varieties. In his most celebrated experiment, he crossed pure yellow with pure green peas, and obtained a generation consisting of only yellow peas. But in the subsequent generation, when these peas were crossed, he obtained a mixed generation of yellow and green peas. Mendel constructed the theory of genes and alleles to explain this phenomenon, which subsequently became the basis of modern genetic science.

You probably know all this anyway, but if you’re interested and need a quick reminder, here’s a short video giving an outline of the theory.

Mendel’s pea experiment was very simple, but from the model he developed he was able to calculate the proportion of peas of different varieties to be expected in subsequent generations. For example, in the situation described above, the theory suggests that there would be no green peas in the first generation, but around 1/4 of the peas in the second generation would be expected to be green.

Mendel’s theory extends to more complex situations; in particular it allows for the inheritance of multiple characteristics. In the video, for example, the characteristic for peas to be yellow/green is supplemented by their propensity to be round/wrinkled. Mendel’s model leads to predictions of the proportion of peas in each generation when stratified  by both these characteristics: round and green, or yellow and wrinkled etc etc.

The interesting thing from a statistical point of view is the way Mendel verified his theory. All scientific theories go through the same validation process: first there are some observations; second those observations lead to a theory; and third there is a detailed scrutiny of further observations to ensure that they are consistent with the theory. If they are, then the theory stands, at least until there are subsequent observations which violate the theory, or a better theory is developed to replace the original.

Now, where there is randomness in the observations, the procedure of ensuring that the observations are in agreement with the theory is more complicated. For example, consider the second generation of peas in the experiment above. The theory suggests that, on average, 1/4 of the peas should be green. So if we take 100 peas from the second generation, we’d expect around 25 of them to be green. But that’s different from saying exactly 25 should be green. Is it consistent with the theory if we get 30 green peas? Or 40? At what point do we decide that the experimental evidence is inconsistent with the theory? This is the substance of Statistics.

Actually, the theory of Mendelian inheritance can be expressed entirely in terms of statistical models. There is a specific probability that certain characteristics are passed on from parents to offspring, and this leads to expected proportions of different types in subsequent generations. And expressed this way, we don’t just learn that 1/4 of second generation peas should be green, but also the probability that in a sample of 100 we get 30, 40 or any number of green peas.

And this leads to something extremely interesting: Mendel’s experimental results are simply too good to be true. For example – though I’m actually making the numbers up here – in repeats of the simple pea experiment he almost always got something very close to 25 green peas out of 100. As explained above, the statistics behind Mendelian inheritance do indeed say that he should have got an average of 25 per population. But the same theory also implies that 20 or 35 green peas out of 100 are entirely plausible, and indeed a spread of experimental results between 20 and 35 is to be expected. But, each of Mendel’s experiments gave a number very close to 25. Ironically, if these really were the experimental results, they would be in violation of the theory, which expects not just an average of 25, but with an appropriate amount of variation around that figure.

So, Mendel’s experimental results were actually a primitive example of fake news. But here’s the thing: Mendel’s theory has subsequently been shown to be correct, even if it seems likely that the evidence he presented had been manipulated to strengthen its case. In modern parlance, Mendel focused on making sure his results supported the predicted average, but failed to appreciate that the theory also implied something about the variation in observations. So even if the experimental results were fake news, the theory itself has been shown to be anything but fake.

To be honest, there is some academic debate about whether Mendel cheated or not. As far as I can tell though, this is largely based on the assumption that since he was also a monk and a highly-regarded scientist, cheating would have been out of character. Nobody really denies the fact that the statistics really are simply too good to be true. Of course, in the end, it really is all academic, as the theory has been proven to be correct and is the basis for modern genetic theory. If interested, you can follow the story a little further here.


Incidentally, the fact that statistical models speak about variation as well as about averages is essential to the way they get used in sports modelling. In football, for example, models are generally estimated on the basis of the average number of goals a team is expected to score. But the prediction of match scores as a potential betting aid requires information about the variation in the number of goals around the average value. And though Mendel seems not to have appreciated the point, a statistical model contains information on both averages and variation, and if a model is to be suitable for data, the data will need to be consistent with the model in terms of both aspects.