# Happy Pi day

Oops, I almost missed the chance to wish you a happy Pi day. So, almost belatedly:

Happy Pi day!!!

You probably know that Pi – or more accurately, 𝜋 – is one of the most important numbers in mathematics, occurring in many surprising places.

Most people first come across 𝜋 at school, where you learn that it’s the ratio between the circumference and the diameter of any circle. But 𝜋 also crops up in almost every other area of mathematics as well, including Statistics.  In a future post I’ll give an example of this.

Meantime, why is today Pi day? Well, today is March 14th, or 3/14 if you’re American. And the approximate value of Pi is 3.14. more accurately, here’s the value of 𝜋 to 100 digits:

3.14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 58209 74944 59230 78164 06286 20899 86280 34825 34211 7067

Not enough for you? You can get the first 100,000 digits here.

But that’s just a part of the story. You probably also know that Pi is what’s known as an irrational number, which means that its decimal representation is infinite and non-repeating. And today it was announced that Pi has just been computed to an accuracy of 31.4 trillion decimal digits, beating the previous most accurate computation by nearly 9 trillion digits.

That’s impressive computing power, obviously, but how about simply remembering the digits of 𝜋? Chances are you remembered from school the first three digits of 𝜋: 3, 1, 4. But the current world record for remembering the value of 𝜋 is 70,030 digits, which is held by Suresh Kumar Sharma of India? And almost as impressively, here’s an 11-year old kid who managed 2091 digits.

Like I say, I’ll write about 𝜋’s importance in Statistics in another post.

Meantime, here’s Homer:

# Mr. Greedy

In parallel to my series of posts on famous statisticians, I also seem to be running a series of posts on characters from the Mr. Men books. Previously we had a post about Mr. Wrong. And now I have to tell you about Mr. Greedy. In doing so, you’ll hopefully learn something about the limitation of Statistics.

It was widely reported in the media last weekend (here, here and here for example) that a recent study had shown that the Mr. Greedy book is as complex a work of literature as various American classics including ‘Of Mice and Men’ and ‘The Grapes of Wrath’, each by John Steinbeck, the latter having won the Pulitzer prize for literature.

To cut a long story short, the authors of the report have developed a method of rating the readability of a book, based essentially on the complexity and phrasing of the words that it uses. They’ve done this by measuring these features for a large number of books, asking people to read the books, measuring how much they understood, and then creating a map from one to the other using standard regression techniques from Statistics. A detailed, though – irony alert! – not very easily readable, description of the analysis is given here.

The end result of this process is a formula which takes the text of a book and converts it into a ‘readability’ score. Mr. Greedy got a score of 4.4, ‘Of Mice and Men’ got 4.5 and ‘The Grapes of Wrath’ got 4.9. The most difficult book in the database was ‘Gulliver’s Travels’, with a score of 13.5. You can check the readability index value – labelled BL for ‘Book Level’ – for any book in the database by using this dialog search box.

So, yes, Mr. Greedy is almost as complex a piece of literature as the Steinbeck classics.

But… there’s a catch, of course. Any statistical analysis is limited to its own terms of reference, which in this case means that comprehension is measured in a strictly literal sense; not comprehension in a narrative sense. In other words, no attempt was made to assess whether readers understood the sum total in a literary sense of what they were reading, just the individual words and sentences. As such, the values 4.4, 4.5 or anything else say nothing about how difficult a book is to read in terms of narrative comprehension. Sure, the words and sentence structure of Mr. Greedy and The Grapes of Wrath are of similar complexity, but having understood the words in both, understanding the full meaning of Mr. Greedy is likely to be an easier task.

Does this have any relevance at all to sports modelling? Admittedly, not much. Except, it’s always important to understand what has, and has not, been included in a sports model. For example, in a football model based only on goals, when using predictions, it’s relevant to consider making adjustments if you are aware that a team has been especially unlucky in previous games (hit the post; marginal offside; etc etc). But if the model itself already included data of this type in its formulation, then it’s likely to be incorrect to make further adjustments, as doing so would be to double-count the effects.

In summary, if you are using a statistical model or analysis, make sure you know what it includes so as to avoid double-counting in sports models, or buying your 2 year-old nephew a Pulitzer prize winning American masterpiece for their birthday.

# Ernie is dead, long live Ernie

Oh no, this weekend they killed Ernie

Well, actually, not that one. This one…

No, no, no. That one died some time ago. This one…

But don’t worry, here’s Ernie (mark 5)…

Let me explain…

Ernie (Electronic Random Number Indicator Equipment) is the acronym of the random number generator that is used by the government’s National Savings and Investments (NSI) department for selecting Premium Bond winners each month.

Premium bonds are a form of savings certificates. But instead of receiving a fixed or variable interest rate paid at regular intervals, like most savings accounts, premium bonds are a gamble. Each month a number of bonds from all those in circulation are selected at random and awarded prizes, with values ranging from ￡25 to ￡1,000,000. Overall, the annual interest rate is currently around 1.4%, but with this method most bond holders will receive 0%, while a few will win many times more than the actual bond value of ￡1, up to one million pounds.

So, your initial outlay is safe when you buy a premium bond – you can always cash them in at the price you paid for them – but you are gambling with the interest.

Now, the interesting thing from a statistical point of view is the monthly selection of the winning bonds. Each month there are nearly 3 million winning bonds, most of which win the minimum prize of ￡25, but 2 of which win the maximum of a million pounds. All these winning bonds have to be selected at random. But how?

As you probably know, the National Lottery is based on a single set of numbers that are randomly generated through the physical mechanism of the mixing and selection of numbered balls. But this method of random number generation is completely impractical for the random selection of several million winning bonds each month. So, a method of statistical simulation is required.

In a previous post we already discussed the idea of simulation in a statistical context. In fact, it turns out to be fairly straightforward to generate mathematically a series of numbers that, to all intents and purposes, look random. I’ll discuss this technique in a future post, but the basic idea is that there are certain formulae which, when used recursively, generate a sequence of numbers that are essentially indistinguishable from a series of random numbers.

But here’s the thing: the numbers are not really random at all. If you know the formula and the current value in the sequence, you can calculate exactly the next value in the sequence. And the next one. And so on.

Strictly, a sequence of numbers generated this way is called ‘pseudo-random’, which is a fancy way of saying ‘pretend-random’. They look random, but they’re not. For most statistical purposes, the difference between a sequence that looks random and is genuinely random is unimportant, so this method is used as the basis for simulation procedures. But for the random selection of Premium Bond winners, there are obvious logistic and moral problems in using a sequence of numbers that is actually predictable, even if it looks entirely random.

For this reason, Ernie was invented. Ernie is a random number generator. But to ensure the numbers are genuinely random, it incorporates a genuine physical process whose behaviour is entirely random. A mathematical representation of the state of this physical process then leads to the random numbers.

The very first Ernie is shown in the second picture above. It was first used in 1957, was the size of a van and used a gas neon diode to induce the randomness. Though effective, this version of Ernie was fairly slow, generating just 2000 numbers per hour. It was subsequently killed-off and replaced with ever-more efficient designs over the years.

The third picture above shows Ernie (mark 4), which has been in operation from 2004 up until this weekend. In place of gas diodes, it used thermal noise in transistors to generate the required randomness, which in turn generated the numbers. Clearly, in terms of size, this version was a big improvement on Ernie (mark 1), being about the size of a normal PC. It was also much more efficient, being able to generate one million numbers in an hour.

But Ernie (mark 4) is no more. The final picture above shows Ernie (mark 5), which came into operation this weekend, shown against the tip of a pencil. It’s essentially a microchip. And of course, the evolution of computing equipment the size of a van to the size of a pencil head over the last 60 years or so is a familiar story. Indeed Ernie (mark 5) is considerably faster – by a factor of 42.5 or so – even compared to Ernie (mark 4), despite the size reduction. But what really makes the new version of Ernie stand out is that the physical process that induces the randomness has fundamentally changed. One way or another, all the previous versions used thermal noise to generate the randomness; Ernie (mark 5) uses quantum random variation in light signals.

More information on the evolution of Ernie can be found here. A slightly more technical account of the way thermal noise was used to generate randomness in each of the Ernie’s up to mark 4 is given here. The basis of the quantum technology for Ernie mark 5 is that when a photon is emitted towards a semi-transparent surface, is either reflected or transmitted at random. Converting these outcomes into 0/1 bit values, forms the building block of random number generation.

Incidentally, although the randomness in the physical processes built into Ernie should guarantee that the numbers generated are random, checks on the output are carried out by the Government Actuary’s Department to ensure that the output can genuinely be regarded as random. In fact they apply four tests to the sequence:

1. Frequency: do all digits occur (approximately) equally often?
2. Serial: do all consecutive number pairs occur (approximately) equally often?
3. Poker: do poker combinations (4 identical digits; 3 identical digits; two pairs; one pair; all different) occur as often as they should in consecutive numbers?
4. Correlation: do pairs of digits at different spacings in bond numbers have approximately the correct correlation that would be expected under randomness?

In the 60 or so years that premium Bonds have been in circulation, the monthly numbers generated by each of the successive Ernie’s have never failed to pass these tests.

However:

Finally, in case you’re disappointed that I started this post with a gratuitous reference to Sesame Street which I didn’t follow-up on, here’s a link to 10 facts and statistics about Sesame Street.

It’s sometimes said that a little knowledge is a dangerous thing. Arguably, too much knowledge is equally bad. Indeed, Einstein is quoted as saying:

A little knowledge is a dangerous thing. So is a lot.

I don’t suppose Einstein had gambling in mind, but still…

March Madness pools are a popular form of betting in the United States. They are based on the playoff tournament for NCAA college basketball, held annually every March, and comprise a so-called bracket bet. Prior to the tournament start, a player predicts the winners of each game from the round-of-sixteen right through to the final. This is possible since teams are seeded, as in tennis, so match pairings for future rounds are determined automatically once the winners from previous rounds are known. In practice, it’s equivalent to picking winners from the round-of-sixteen onwards in the World Cup.

There are different scoring systems for judging success in bracket picks, often with more weight given to correct outcomes in the later rounds, but in essence the more correct outcomes a gambler predicts, the better their score. And the player with the best score within a pool of players wins the prize.

Naturally, you’d expect players with some knowledge of the differing strength of the teams involved in the March Madness playoffs to do better than those with no knowledge at all. But is it the case that the more knowledge a player has, the more successful they’re likely to be? In other words:

To what extent is success in the March Madness pools determined by a player’s basketball knowledge?

This question was explored in a recent academic study discussed here. In summary, participants were given a 25-question basketball quiz, the results of which were used to determine their level of basketball knowledge. Next, they were asked to make their bracket picks for the March Madness. A comparison was then made between accuracy of bracket picks and level of basketball knowledge.

The results are summarised in the following graph, which shows the average relationship between pick accuracy and basketball knowledge:

As you’d expect, the players with low knowledge do relatively badly.  Then, as a player’s basketball knowledge increases, so does their pick accuracy. But only up to a point. After a certain point, as a player’s knowledge increases, their pick accuracy was found to decrease. Indeed, the players with the most basketball knowledge were found to perform slightly worse than those with the least knowledge!

Why should this be?

The most likely explanation is as follows…

Consider an average team, who have recently had a few great results. It’s possible that these great results are due to skill, but it’s also plausible that the team has just been a bit lucky. The player with expert knowledge is likely to know about these recent results, and make their picks accordingly. The player with medium knowledge  will simply know that this is an average team, and also bet accordingly. While the player with very little knowledge is likely to treat the team randomly.

Random betting due to lack of knowledge is obviously not a great strategy. However, making picks that are driven primarily by recent results can be even worse, and the evidence suggests that’s exactly what most highly  knowledgable players do. And it turns out to be better to have just a medium knowledge of the game, so that you’d have a rough idea of the relative rankings of the different teams, without being overly influenced by recent results.

Now, obviously, someone with expert knowledge of the game, but who also knows how to exploit that knowledge for making predictions, is likely to do best of all. And that, of course, is the way sports betting companies operate, combining expert sports knowledge with statistical support to exploit and implement that knowledge. But the study here shows that, in the absence of that explicit statistical support, the player with a medium level of knowledge is likely to do better than players with too little or too much knowledge.

In some ways this post complements the earlier post ‘The benefit of foresight’. The theme of that post was that successful gambling cannot rely solely on Statistics, but also needs the input of expert sports knowledge. This one says that expert knowledge, in isolation, is also insufficient, and needs to be used in tandem with statistical expertise for a successful trading strategy.

In the specific context of betting on the NCAA March Madness bracket, the argument is developed fully in this book. The argument, though, is valid much more widely across all sports and betting regimes, and emphasises the importance to a sports betting company of both statistical and sport expertise.

Update (21/3): The NCAA tournament actually starts today. In case you’re interested, here’s Barack Obama’s bracket pick. Maybe see if you can do better than the ex-President of the United States…

# The origin of all chavs upon this earth

This is a true story which includes an illustration of how interesting statistical questions can arise in simple everyday life. It’s a bit long though, so I’ll break it down into two posts. In this one, I’ll give you the background information. In a subsequent post, I’ll discuss a possible solution to the problem that arises.

As many of you know, I live in Italy. Actually, in a small town called Belluno in the north-east of Italy, on the edge of the dolomites. It’s great, but like most people’s life journey, my route here hasn’t been straightforward.

I grew up on a dismal overflow council estate called Leigh Park, on the distant outskirts of Portsmouth. Leigh Park was once the largest council estate in Europe and, according to this article, “could well be the origin of all chavs upon this earth”. (Just in case you’re unfamiliar with the term chav, Wikipedia gives this definition: “A pejorative epithet used in the United Kingdom to describe a particular stereotype of anti-social youth dressed in sportswear”. Explains a lot, right?)

Anyway, the other day I had to take my son to the dentist in Belluno for a check-up. The waiting area in the dentist’s has recently been refurbished, and they’ve installed a large-screen TV on the main wall. But instead of showing anything interesting, the TV just seems to flip through random images: pictures of animals; of paintings; of architecture; of cities; of people; of pretty much anything. It’s probably meant to be soothing or distracting while you’re waiting for your teeth to be drilled.

So, I sat down and started looking at this TV. And the first image I saw was of a single-decker bus with destination Leigh Park (!), a little bit like this…

My first thought, obviously, was that this was a remarkably unlikely coincidence: a TV screen in Belluno, Italy, showing the image of a bus heading towards the completely unremarkable housing estate I grew up on in England. But we’ve discussed this kind of issue before: our lives are filled with many random events each day, and we only notice the ones that are coincidences. So though it seems improbable that something like this could occur, it’s much less improbable when you balance it against the many, many unremarkable things in a day which also occur.

But the main theme of the story is something different…

I wanted to point out this coincidence – which connects to part of his own family history – to my son, but by the time I managed to drag his attention away from playing on his phone, the image had changed to something else. Having nothing better to do – my potential company for this visit was just playing on his phone, remember – I kept watching the TV screen. Now, although the images kept changing, I noticed after a while that some of the images had repeated. Not in any systematic order, but apparently at random. So, my best guess is that the screen was showing images from a fixed library of pictures in a random order. As such, the image of the Leigh Park bus would show up again at some point, but the time it would take to show up would depend on the size of the library of images. If there were just a few pictures, it would probably show up again very soon; if there were very many pictures, it would most likely take a long time.

So, here’s the question:

How could I estimate the number of images in the library being shown randomly on the TV?

This seems like a reasonable sort of question to ask. I have some data – a list of images I’ve observed, together with counts of the numbers of repeats I’ve seen. And the more often I see repeats, the smaller I might expect the number of available images to be. But what strategy could I use, either based on the images I’d already observed, or by also observing extra images, to estimate the number of images in the entire library?

I have an idea of how to tackle this problem, and I’ll discuss it in a future post. But I don’t think my idea is likely to be the best approach, and I’d be interested if anyone else has an alternative, which might well prove to be better. So, please think about this problem, and if you have suggestions of your own, please send them to me at stuart.coles1111@gmail.com. I’ll include discussion of any ideas I receive in the subsequent post.

# Rare stickers

In an earlier post we looked at the number of packets of Panini stickers you’d be likely to need in order to complete an album. In that post I made the throwaway and obvious remark that if some stickers were rarer than others, you’d be likely to need even more packs than the standard calculations suggest. But is there any evidence that some stickers are rarer than others?

The official Panini response is given here. In reply to the question “Does Panini deliberately print limited edition or rare stickers?”, the company answer is:

No. Every collection consists of a number of stickers printed on one or two print sheets, which in turn contain one of each sticker and are printed in the same quantity as the albums.

So that seems clear enough. But the experience of collectors is rather different, with many suggesting that some stickers are much harder to find than others. So, what’s the data and what type of statistical methods can be used to judge the evidence?

The first thing to say is that, as we saw in the previous post, the nature of collecting this way can be a frustrating experience. So that even if the average number of packs needed was less than 1000, we saw that some collectors are likely to need more than 2000, even assuming all stickers are equally frequent. To put this into perspective, suppose you’ve already collected 662 of the 682 stickers, and need just another 20 stickers to complete the album. By the same type of calculation that we made before, the expected number of further packs needed to complete the album is 491, which is well over half the expected total number of packs needed (which was 969). You might like to try that calculation yourself, using the same method as in the previous post.

In other words, once you reach the point of needing just 20 more stickers, you’re not even half-way through your collection task in terms of the number of packs you’re likely to need. This can make it seem like certain stickers – the 20 you are missing – are rarer than others, even if they’re not.

But that’s not to say that rare stickers don’t exist – it’s just that the fact you’re like to have to wait so long to get the remaining few stickers might make you feel like they are rare. So again: do rare stickers really exist?

Well, here’s a site that tried to conduct the experiment by buying 1000 packs from the 2018 world cup sticker series, listing every single sticker they got and the number of times they got it. Since each pack contains 5 stickers, this means that they collected a total of 5000 stickers. They then used the proportion of times they got each sticker as an estimate of the probability of getting each sticker. For example, they were fortunate enough to get the Belgium team photo sticker 18 times, so they estimated the probability of getting such a sticker as

$18/5000 \approx 1/278$

Using this method, they were able to calculate which team collectively were the easiest and most difficult to complete. The results are summarised in the following figure:

On this basis, stickers of players from Senegal and Columbia were easy to obtain, while those of players from Belgium and Saudi Arabia were much harder. So although the Belgium team photo was one of the most frequent stickers in their collection, the individual players from Belgium were among the least frequent.

Now, you won’t need me to tell you that this is pretty much a waste of time. With just 5000 stickers collected, it’s bound to be that some stickers occur more often than others, and we don’t learn anything this way about whether stickers of certain teams or players are genuinely more difficult to find than others. One could try to ascertain whether the pattern of results here is consistent with an equal distribution of stickers, but there’s no mention of such an analysis being done, and a sample of just 5000 stickers would probably be too small to reach definitive conclusions anyway.

One fun thing though: with the 5000 stickers collected, these guys managed to complete the album except for one missing sticker: Radja Nainggolan of Belgium.  But he ended up not being selected for the World Cup anyway 😀.

This doesn’t really bring us any closer to the question of whether rare stickers exist or not. One interesting suggestion is to look at frequencies of requests in sites that handle markets for second-hand stickers. And indeed, this site finds jumps in frequencies of requests for certain types of stickers, suggesting such types are rarer than others. However, it’s not necessarily the case that the rare stickers are the ones with the most requests: Lionel Messi might be a popular trade, because… Messi… and not because his sticker is rare. Still, the post is a fun read, with complete details about how you might approach this type of analysis.

Finally, far be it for me to promote the exploits of one of our competitors, but some of the guys at ATASS pooled their world cup collections to address exactly this issue. A complete description of their findings can be found here. In summary, from an analysis of nearly 11,000 stickers, they found evidence that shiny stickers – of which there were 50 in the 2018 World Cup series – are much rarer than standard stickers.

Moreover, the strength of the evidence is completely overwhelming. This is partly because the number of stickers collected is large – 11,000 rather than 5,000 in the study I mentioned above – but also because there are 50 shiny stickers, and all of them occurred with a lower frequency than an average sticker. In fact, overall, shiny stickers occurred at a rate that’s around half that of a normal sticker. Now, if a single sticker isn’t found, it’s reasonable to put that down to chance; but it’s beyond the realms of plausibility that 50 stickers of a certain kind all occurred at below the average rate.

On this basis, Panini’s claim that all stickers are equally likely is completely implausible. The only way it could really hold true, assuming the ATASS analysis to be correct, is if there were variations in the physical distribution of stickers, either geographically or temporally. So, although Panini might produce all stickers in equal numbers, variations in distribution might mean that some stickers were harder to get at certain times in certain places.

That seems unlikely though, and the evidence does seem to point to the fact that shiny stickers in the World Cup 2018 series were indeed harder to find than the others.

In summary: yes, rare stickers exist.

# Here’s to all the money at the end of the world

I made the point in last week’s Valentine’s Day post, that although the emphasis of this blog is about the methodology of using Statistics to understand the world through the analysis of data, it’s often the case that statistics in themselves tell their own story. In this way we learnt that a good proportion of the population of the UK buy their pets presents for Valentine’s Day.

As if that wasn’t bad enough, I now have to report to you the statistical evidence for the fact that nature itself is dying. Or as the Guardian puts it:

Plummeting insect numbers `threaten collapse of nature’

The statistical and scientific evidence now points to the fact that, at current rates of decline, all insects could be extinct by the end of the century. Admittedly, it’s probably not great science or statistics to extrapolate the current annual loss of 2.5% in that way, but nevertheless it gives you a picture of the way things are going. This projected elimination of insects would be, by some definitions, the sixth mass extinction event on earth. (Earlier versions wiped out dinosaurs and so on).

And before you go all Donald Trump, and say ‘bring it on: mosquito-free holidays’, you need to remember that life on earth is a complex ecological system in which the big things (including humans) are indirectly dependent on the little things (including insects) via complex bio-mechanisms for mutual survival. So if all the insects go, all the humans go too. And this is by the end of the century, remember.

Here’s First Dog on the Moon’s take on it:

So, yeah, let’s do our best to make money for our clients. But let’s also not forget that money only has value if we have a world to spend it in, and use Statistics and all other means at our disposal to fight for the survival of our planet and all the species that live on it.

# Famous statisticians: Sir Francis Galton

This is the second in a so-far very short series on famous statisticians from history. You may remember that the first in the series was on John Tukey. As I said at that time, rather than just include statisticians randomly in this series, I’m going to focus on those who have had an impact beyond the realm of just statistics.

With that in mind, this post is about Sir Francis Galton (1822-1911), an English statistician who did most of his work in the second half of the 19th century, around the time that Statistics was being born as a viable scientific discipline.

You may remember seeing Galton’s name recently. In a recent post on the bean machine, I mentioned that the device also goes under the name of ‘Galton board’. This is because Galton was the inventor of the machine, which he used to illustrate the Central Limit Theorem, as discussed in the earlier post. You may also remember an earlier post in which I discussed `regression to the mean’; Galton was also the first person to explore and describe this phenomenon, as well as the more general concept of correlation to describe the extent to which two random phenomena are connected.

It’s probably no coincidence that Galton was a half-cousin of Charles Darwin, since much of Galton’s pioneering work was on the way statistics could be used to understand genetic inheritance and human evolution. Indeed, he is the inventor of the term eugenics, which he coined during his attempts to understand the extent to which intelligence is inherited, rather than developed.

Galton is described in Wikipedia as:

• A statistician
• A progressive
• A polymath
• A sociologist
• A psychologist
• An anthropologist
• A eugenicist
• A tropical explorer
• A geographer
• An inventor
• A meteorologist
• A proto-geneticist
• A psychometrician

And you thought you were busy. Anyway, it’s fair to say that Galton falls in my category of statisticians who have done something interesting with their lives outside of Statistics.

His various contributions apart from those mentioned above include:

1. He invented the use of weather maps for popular use;
2. He wrote a book ‘The Art of Travel’ which offered practical travel advice to Victorians;
3. He was the first to propose the use of questionnaires as a means of data collection;
4. He conceived the notion of standard deviation as a way of summarising the variation in data;
5. He devised a technique called composite portraiture which was an early version of photoshop for making montages of photographic portraits;
6. He pretty much invented the technique of fingerprinting for identifying  individuals by their fingerprints.

In summary, many of the things Galton worked on or invented are still relevant today. And this is just as true for his non-statistical contributions, as for his statistical ones. Of course, it’s an unfortunate historical footnote that his theory of eugenics – social engineering to improve biological characteristics in populations – was adopted and pushed to extremes in Nazi Germany, with unthinkable consequences.

In retrospect, it’s a pity he didn’t just stop once he’d invented the bean machine.

# Happy Valentine’s Day

Happy Valentine’s Day. In case you didn’t get any cards or gifts today, please know that Smartodds loves Statistics loves you.

Anyway, I thought it might be interesting to research some statistics about Valentine’s day, and found this article, from which I learned much more about the population of Britain than I was expecting to.

Here are some of the highlights:

1. A significant number of people spend money for Valentine’s day on their pets. This number varies per generation, and is as high as 8.7% for millennials.
2. A slightly smaller, but still significant,  number of people spend money on themselves for Valentine’s. Again, this trend is most prevalent among millennials, and also more common for women than men.
3. 36.2% of people get unwanted gifts most years.
4. 19% of people plan to celebrate Valentine’s late in order to save money by buying cards and gifts once the prices have dropped.

I’m not sure which of these statistics I found to be the more shocking.

Most of the posts in this blog are about the way Statistics as a science can be used to investigate problems and interpret data. But sometimes, the statistics are fascinating in themselves, and don’t require any kind of mathematical sophistication to reveal the secrets that they contain.

Anyway, I have to run now to buy myself my girlfriend a gift

Happy Valentine’s…

# Dance, dance, dance…

Ever thought: ‘I’m pretty sure I would fully understand Statistics, if only a modern dance company would illustrate the techniques for me’?

I hope you get the idea of what I’m trying to do with this blog by now. Fundamentally, Statistics is a very intuitive subject, but that intuition is often masked by technicalities, so that from the outside the subject can seem both boring and impenetrable. The aim of all of my posts is to try to show that neither of those things is true: Statistics is both fascinating and easily understandable. And in this way, whatever your connection to Smartodds, you’ll be better equipped to understand the statistical side of the company’s operations.

Of course, I’m not the only person to try to de-mystify Statistics, and there are many books, blogs and other learning aids with similar aims.

With this in mind, I recently came across a rather unusual set of resources for learning Statistics: a series of dance videos whose aim is to explain statistical concepts through movement. Probably my ‘favourite’ is this one, which deals with the notions of sampling and standard error. You might like to take a look…

I think it fair to say that the comments on these videos on YouTube are mixed. One person wrote:

This way it makes complicated things look simpler. Very informative and useful. Loved it. 🙂

While another said:

this makes simple things look complicated but thanks anyway

So, I guess it depends on your perspective. I think I’m on the side of the latter commenter though: I’m pretty sure that in 5 minutes I could give a much clearer and more entertaining explanation of the issues this film is trying to address than the film does itself. But maybe that’s not the point. Perhaps the point is that different things hook different people in, and while personally I can’t think of a much more complicated way of thinking about issues of sampling and measuring accuracy, the dance perspective seems to work for some people.

Anyway, if you think this might be the key to help you unlock some of the mysteries of Statistics, you can find the full series of four videos here, covering topics like correlation and standard deviation. Enjoy.