Faking it

 

Take a look at the following table:

fake_data

 

It shows the total land area, in square kilometres, for various countries. Actually, it’s the first part of a longer alphabetical list of all countries and includes two columns of figures, each purporting to be the corresponding area of each country. But one of these columns contains the real areas and the other one is fake. Which is which?

Clearly, if your knowledge of geography is good enough that you know the land area of Belgium – or any of the other countries in the table – or whether Bahrain is bigger than Barbados, then you will know the answer. You could also cheat and check with Google. But you can answer the question, and be almost certain of being correct, without cheating and without knowing anything about geography. Indeed, I could have removed the first column giving the country names, and even not told you that the data correspond to land areas, and you should still have been able to tell me which column is real and which is fake.

So, which column is faking it? And how do you know?

I’ll write a follow-up post giving the answer and explanation sometime soon. Meantime, if you’d like to write to me giving your own version, I’d be happy to hear from you.

 

Freddy’s story: part 2

In a previous post I discussed a problem that Freddy.Teuma@smartodds.co.uk had written to me about. The problem was a simplified version of an issue sent to him by friend, connected with a genetic algorithm for optimisation. Simply stated: you start with £100. You toss a coin and if it comes up tails you lose 25% of your current money, otherwise you gain 25%. You play this game over and over, always increasing or increasing your current money by 25% on the basis of a coin toss. The issue is how much money you expect to have, on average, after 1000 rounds of this game.

As I explained in the original post, Freddy’s intuition was that the average should stay the same at each round. So even after 1000 (or more) rounds, you’d have an average of £100. But when Freddy simulated the process, he always got an amount close to £0, and so concluded his intuition must be wrong.

A couple of you wrote to give your own interpretations of this apparent conflict, and I’m really grateful for your participation. As it turns out, Freddy’s intuition was spot on, and his argument was pretty much a perfect mathematical proof. Let me make the argument just a little bit more precise.

Suppose after n rounds the amount of money you have is M. Then after n+1 rounds you will have (3/4)M if you get a Head and (5/4)M if you get a Tail. Since each of these outcomes is equally probable, the average amount of money after n+1 rounds is

\frac{ (3/4)M + (5/4)M}{2}= M

In other words, exactly as Freddy had suggested, the average amount of money doesn’t change from one round to the next. And since I started with £100, this will be the average amount of money after 1 round, 2 rounds and all the way through to 1000 rounds.

But if Freddy’s intuition was correct, the simulations must have been wrong.

Well, no. I checked Freddy’s code – a world first! – and it was perfect. Moreover, my own implementation displayed the same features as Freddy’s, as shown in the previous post: every simulation has the amount of money decreasing to zero long before 1000 rounds have been completed.

So what explains this contradiction between what we can prove theoretically and what we see in practice?

The following picture shows histograms of the money remaining after a certain number of rounds for each of 100,000 simulations. In the previous post I showed the individual graphs of just 16 simulations of the game; here we’re looking at a summary of 100,000 simulated games.

For example, after 2 rounds, there are only 3 possible outcomes: £56.25, £93.75 and £156.25. You might like to check why that should be so. Of these, £93.75 occurred most often in the simulations, while the other two occurred more or less equally often. You might also like to think why that should be so. Anyway, looking at the values, it seems plausible that the average is around £100, and indeed the actual average from the simulations is very close to that value. Not exact, because of random variation, but very close indeed.

After 5 rounds there are more possible outcomes, but you can still easily convince yourself that the average is £100, which it is. But once we get to 10 rounds, it starts to get more difficult. There’s a tendency for most of the simulated runs to give a value that’s less than £100, but then there are relatively few observations that are quite a bit bigger than £100. Indeed, you can just about see that there is one or more value close to £1000 or so. What’s happening is that the simulated values are becoming much more asymmetric as the number of rounds increases. Most of the results will end up below £100 – though still positive, of course – but a few will end up being much bigger than £100. And the average remains at £100, exactly as the theory says it must.

After 100 rounds, things are becoming much more extreme. Most of the simulated results end up close to zero, but one simulation (in this case) gave a value of around £300,000. And again, once the values are averaged, the answer is very close to £100.

But how does this explain what we saw in the previous post? All of the simulations I showed, and all of those that Freddy looked at, and those his friend obtained, showed the amount of money left being essentially zero after 1000 rounds. Well, the histogram of results after 1000 rounds is a much, much more extreme case of the one shown above for 100 rounds. Almost all of the probability is very, very close to zero. But there’s a very small amount of probability spread out up to an extremely large value indeed, such that the overall average remains £100. So almost every time I do a simulation of the game, the amount of money I have is very, very close to zero. But very, very, very occasionally, I would simulate a game whose result was a huge amount of money, such that it would balance out all of those almost-zero results and give me an answer close to £100. But, such an event is so rare, it might take billions of billions of simulations to get it. And we certainly didn’t get it in the 16 simulated games that I showed in the previous post.

So, there is no contradiction at all between the theory and the simulations. It’s simply that when the number of rounds is very large, the very large results which could occur after 1000 rounds, and which ensure that the average balances out to £100, occur with such low probability that we are unlikely to simulate enough games to see them. We therefore see only the much more frequent games with low winnings, and calculate an average which underestimates the true value of £100.

There are a number of messages to be drawn from this story:

  1. Statistical problems often arise in the most surprising places.
  2. The strategy of problem simplification, solution through intuition, and verification through experimental results is a very useful one.
  3. Simulation is a great way to test models and hypotheses, but it has to be done with extreme care.
  4. And if there’s disagreement between your intuition and experimental results, it doesn’t necessarily imply either is wrong. It may be that the experimental process has complicated features that make results unreliable, even with a large number of simulations.

Thanks again to Freddy for the original problem and the discussions it led to.


To be really precise, there’s a bit of sleight-of-hand in the mathematical argument above. After the first round my expected – rather than actual – amount of money is £100. What I showed above is that the average money I have after any round is equal to the actual amount of money I have at the start of that round. But that’s not quite the same thing as showing it’s equal to the average amount of money I have at the start of the round.

But there’s a famous result in probability – sometimes called the law of iterated expectations – which lets me replace this actual amount at the start of the second round with the average amount, and the result stays the same. You can skip this if you’re not interested, but let me show you how it works.

At the start of the first round I have £100.

Because of the rules of the game, at the end of this round I’ll have either £75 or £125, each with probability 1/2.

In the first case, after the second round, I’ll end up with either £56.25 or £93.75, each with probability 1/2. And the average of these is £75.

In the second case, after the second round, I’ll end up with either £93.75 or £125.75, each with probability 1/2. And the average of these is £125.

And if I average these averages I get £100. This is the law of iterated expectations at work. I’d get exactly the same answer if I averaged the four possible 2-round outcomes: £56.25, £93.75 (twice) and £125.75.

Check:

\frac{56.25 + 93.75 + 93.75 + 125.75}{4} = 100

So, my average after the second round is equal to the average after the first which was equal to the initial £100.

The same argument also applies at any round: the average is equal to the average of the previous round. Which in turn was equal to the average of the previous round. And so on, telescoping all the way back to the initial value of £100.

So, despite the sleight-of-hand, the result is actually true, and this is precisely what Freddy had hypothesised. As explained above, his only ‘mistake’ was to observe that a small number of simulations suggested a quite different behaviour, and to assume that this meant his mathematical reasoning was wrong.

 

Freddy’s story: part 1

This is a great story with a puzzle and an apparent contradiction at the heart of it, that you might like to think about yourself.

A couple of weeks ago Freddy.Teuma@smartodds.co.uk wrote to me to say that he’d been looking at the recent post which discussed a probability puzzle based on coin tossing, and had come across something similar that he thought might be useful for the blog. Actually, the problem Freddy described was based on an algorithm for optimisation using genetic mutation techniques, that a friend had contacted him about.

To solve the problem, Freddy did four smart things:

  1. He first simplified the problem to make it easier to tackle, while still maintaining its core elements;
  2. He used intuition to predict what the solution would be;
  3. He supported his intuition with mathematical formalism;
  4. He did some simulations to verify that his intuition and mathematical reasoning were correct.

This is exactly how a statistician would approach both this problem and problems of greater complexity.

However… the pattern of results Freddy observed in the simulations contradicted what his intuition and mathematics had suggested would happen, and so he adjusted his beliefs accordingly. And then he wrote to me.

This is the version of the problem that Freddy had simplified from the original…

Suppose you start with a certain amount of money. For argument’s sake, let’s say it’s £100. You then play several rounds of a game. At each round the rules are as follows:

  1. You toss a fair coin (Heads and Tails each have probability 1/2).
  2. If the coin shows Heads, you lose a quarter of your current amount of money and end up with 3/4 of what you had at the start of the round.
  3. If the coin shows Tails, you win a quarter of your current amount of money and end up with 5/4 of what you had at the start of the round.

For example, suppose your first 3 tosses of the coin are Heads, Tails, Heads. The money you hold goes from £100, to £75 to £93.75 to £70.3125.

Now, suppose you play this game for a large number of rounds. Again, for argument’s sake, let’s say it’s 1000 rounds. How much money do you expect to have, on average, at the end of these 1000 rounds?

Have a think about this game yourself, and see what your own intuition suggests before scrolling down.

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

Freddy’s reasoning was as follows. In each round of the game I will lose or gain 25% of my current amount of money with equal probability. So, if I currently have £100, then at the end of the next round I will have either £75 or £125 with equal probability. And the average is still £100. This reasoning is true at each round of the game. And so, after any number of rounds, including 1000, I’d expect to have exactly the same amount of money as when I started: £100.

But when Freddy simulated the process, he found a different sort of behaviour. In each of his simulations, the money held after 1000 rounds was very close to zero, suggesting that the average is much smaller than £100.

I’ve taken the liberty of doing some simulations myself: the pattern of results in 16 repeats of the game, each time up to 1000 rounds,  is shown in the following figure.

Each panel of the figure corresponds to a repeat of the game, and in each repeat I’ve plotted a red trace showing how much money I hold after each round of the game.  In each case you can see that I start with £100, there’s then a bit of oscillation – more in some of the realisations than in others, due to random variation – but in all cases the amount of money I have hits something very close to zero somewhere before 250 rounds and then stays there right up to 1000 rounds.

So, there is indeed a conflict between Freddy’s intuition and the picture that these simulations provide.

What’s going on?

I’ll leave you to think about it for a while, and write with my own explanation and discussion of the problem in a future post. If you’d like to write to me to explain what you think is happening, I’d be very happy to hear from you.

Obviously, I’m especially grateful to Freddy for having sent me the problem in the first place, and for agreeing to let me write a post about it.


Update: if you’d like to run the simulation exercise yourself, just click the ‘run’ button in the following window. This will simulate the game for 1000 rounds, starting with £100. The graph will show you how much money you hold after each round of the game, while if you toggle to the console window it will tell you how much money you have after the 1000th round (to the nearest £0.01). This may not work in all browsers, but seems to work ok in Chrome. You can repeat the experiment simply by clicking ‘Run’ again. You’re likely to get a different graph each time because of the randomness in the simulations. But what about the final amount? Does that also change? And what does it suggest about Freddy’s reasoning that the average amount should stay equal to £100?

game_sim<-function(n_rounds=1000, money_start=100){ require(ggplot2) money<-c() money[1]<-money_start for(i in 2:(n_rounds)){ money[i]<-money[i-1]*sample(c(.75,1.25),1) } m<-data.frame(round=1:n_rounds,money=money) cat('Money in pounds after ',n_rounds, ' rounds is ',round(money[n_rounds],2)) ggplot(aes(x=round,y=money),data=m)+geom_line(color='red')+ ggtitle('Money') } game_sim()

Taking things to extremes

One of the themes I’ve tried to develop in this blog is the connectedness of Statistics. Many things which seem unrelated, turn out to be strongly related at some fundamental level.

Last week I posted the solution to a probability puzzle that I’d posted previously. Several respondents to the puzzle, including Olga.Turetskaya@smartodds.co.uk, included the logic they’d used to get to their answer when writing to me. Like the others, Olga explained that she’d basically halved the number of coins in each round, till getting down to (roughly) a single coin. As I explained in last week’s post, this strategy leads to an answer that is very close to the true answer.

Anyway, Olga followed up her reply with a question: if we repeated the coin tossing puzzle many, many times, and plotted a histogram of the results – a graph which shows the frequencies of the numbers of rounds needed in each repetition – would the result be the typical ‘bell-shaped’ graph that we often find in Statistics, with the true average sitting somewhere in the middle?

Now, just to be precise, the bell-shaped curve that Olga was referring to is the so-called Normal distribution curve, that is indeed often found to be appropriate in statistical analyses, and which I discussed in another previous post. To answer Olga, I did a quick simulation of the problem, starting with both 10 and 100 coins. These are the histograms of the results.

So, as you’d expect, the average values (4.726 and 7.983 respectively) do indeed sit nicely inside the respective distributions. But, the distributions don’t look at all bell-shaped – they are heavily skewed to the right. And this means that the averages are closer to the lower end than the top end. But what is it about this example that leads to the distributions not having the usual bell-shape?

Well, the normal distribution often arises when you take averages of something. For example, if we took samples of people and measured their average height, a histogram of the results is likely to have the bell-shaped form. But in my solution to the coin tossing problem, I explained that one way to think about this puzzle is that the number of rounds needed till all coins are removed is the maximum of the number of rounds required by each of the individual coins. For example, if we started with 3 coins, and the number of rounds for each coin to show heads for the first time was 1, 4 and 3 respectively, then I’d have had to play the game for 4 rounds before all of the coins had shown a Head. And it turns out that the shape of distributions you get by taking maxima is different from what you get by taking averages. In particular, it’s not bell-shaped.

But is this ever useful in practice? Well, the Normal bell-shaped curve is somehow the centrepiece of Statistics, because averaging, in one way or another, is fundamental in many physical processes and also in many statistical operations. And in general circumstances, averaging will lead to the Normal bell-shaped curve.

Consider this though. Suppose you have to design a coastal wall to offer protection against sea levels. Do you care what the average sea level will be? Or you have to design a building to withstand the effects of wind. Again, do you care about average winds? Almost certainly not. What you really care about in each case will be extremely large values of the process: high sea-levels in one case; strong winds in the other. So you’ll be looking through your data to find the maximum values – perhaps the maximum per year – and designing your structures to withstand what you think the most likely extreme values of that process will be.

This takes us into an area of statistics called extreme value theory. And just as the Normal distribution is used as a template because it’s mathematically proven to approximate the behaviour of averages, so there are equivalent distributions that apply as templates for maxima. And what we’re seeing in the above graphs – precisely because the data are derived as maxima – are examples of this type. So, we don’t see the Normal bell-shaped curve, but we do see shapes that resemble the templates that are used for modelling things like extreme sea levels or wind speeds.

So, our discussion of techniques for solving a simple probability puzzle with coins, leads us into the field of extreme value statistics and its application to problems of environmental engineering.

But has this got anything to do with sports modelling? Well, the argument about taking the maximum of some process applies equally well if you take the minimum. And, for example, the winner of an athletics race will be the competitor with the fastest – or minimum – race time. Therefore the models that derive from extreme value theory are suitable templates for modelling athletic race times.

So, we moved from coin tossing to statistics for extreme weather conditions to the modelling of race times in athletics, all in a blog post of less than 1000 words.

Everything’s connected and Statistics is a very small world really.

Heads up

heads

I recently posted a problem that had been shown to me by Benoit.Jottreau@smartodds.co.uk. Basically, you have a bunch of coins. You toss them and remove the ones that come up heads. You then repeat this process over and over till all the coins have been removed. The question was, if you start with respectively 10 or 100 coins, how many rounds of this game does it take on average till all the coins have been removed?

I’m really grateful to all of you who considered the problem and sent me a reply. The answers you sent me are summarised in the following graphs.

 

 

The graph on the left shows the counts of the guesses for the number of rounds needed when starting with 10 coins; the one on the right is the counts  but starting with 100 coins. The main features are as follows:

  • Starting with 10 coins, the most popular answer was 4 rounds; with 100 coins the most popular answer was either 7 or 8 rounds.
  • Almost everyone gave whole numbers as their answers. This wasn’t necessary. Even though the result of every experiment has to be a whole number, the average doesn’t. In a similar way, the average number of goals in a football match is around 2.5.
  • The shape of the distribution of answers for the two experiments is much the same: heavily skewed to the right. This makes sense given the nature of the experiment: we can be pretty sure a minimum number of rounds will be needed, but less sure about the maximum. This is reflected in your collective answers.
  • Obviously, with more coins, there’s more uncertainty about the answer, so the spread of values is much greater when starting with 100 coins.

Anyway, I thought the replies were great, and much better than I would have come up with myself if I’d just gone with intuition instead of solving the problem mathematically.

A few people also kindly sent me the logic they used to get to these answers. And it goes like this…

Each coin will come up heads or tails with equal probability. So, the average number of coins that survive each round is half the number of coins that enter that round. This is perfectly correct. So, for example, when starting with 10 coins, the average number of coins in the second round is 5. By the same logic, the average number of coins in the second round is 2.5. And the average number of coins in the third round is 1.25. And in the fourth round it’s 0.625. So, the first time the average number of coins goes below 1 is on the fourth round, and it’s therefore reasonable to assume 4 is the average number of rounds for all the coins to be removed.

Applying the same logic but starting with 100 coins, it takes 7 rounds for the average number of coins to fall below 1.

With a slight modification to the logic, to always round to whole numbers, you might get slightly different answers: say 5 and 8 instead of 4 and 7. And looking at the answers I received, I guess most respondents applied an argument of this type.

This approach is really great, since it shows a good understanding of the main features of the process: 50% of coins dropping out, on average, at each round of the game. And it leads to answers that are actually very informative: knowing that I need 4 rounds before the average number of coins drops below 1 is both useful and very precise in terms of explaining the typical behaviour of this process.

However… you don’t quite get the exact answers for the average number of rounds, which are 4.726 when you start with 10 coins, and 7.983 when you start with 100 coins. But where do these numbers come from, and why doesn’t the simple approach of dividing by 2 until you get below 1 work exactly?

Well, as I wrote above, starting with 10 coins you need 4 rounds before the average number of coins falls below 1. But this is a statement about the average number of coins. The question I actually asked was about the average number of rounds. Now, I don’t want to detract from the quality of the answers you gave me. The logic of successively dividing by 2 till you get below one coin is great, and as I wrote above, it will give an answer that is meaningful in its own right, and likely to be close to the true answer. But, strictly speaking, it’s focussing on the wrong aspect of the problem: the number of coins instead of the number of rounds.

The solution is quite technical. Not exactly rocket science, but still more intricate than is appropriate for this blog. But you might still find it interesting to see the strategy which leads to a solution.

So, start by considering just one of the coins. Its pattern of results (writing H for Heads and T for Tails) will be something like

  • T, T, H; or
  • T, T, T, H; or
  • H

That’s to say, a sequence of T’s followed by H (or just H if we get Heads on the first throw).

But we’ve seen something like this before. Remember the post Get out of jail? We kept rolling a dice until we got the first 6, and then stopped. Well, this is the same sort of experiment, but with a coin. We keep tossing the coin till we get the first Head. Because of the similarity between these experiments, we can apply the same technique to calculate the probabilities for the number of rounds needed to get the first Head for this coin. One round will have probability 1/2, two rounds 1/4, three rounds 1/8 and so on.

Now, looking at the experiment as a whole, we have 10 (or 100) coins, each behaving the same way. And we repeat the experiment until all of the coins have shown heads for the first time. What this means is that the total number of rounds needed is the maximum of the number of rounds for each of the individual coins. It turns out that this gives a simple method for deriving a formula that gives the probabilities of the number of rounds needed for all of the coins to be removed, based on the probabilities for a single coin already calculated above.

So, we now have a formula for the probabilities for the numbers of rounds needed. And there’s a standard formula for converting this formula into the  average. It’s not immediately obvious when you see it, but with a little algebraic simplification it turns out that you can get the answer in fairly simple mathematical form. Starting with n coins – we had n=10 and n=100 –  the average number of rounds needed turns out to be

1+\sum_{k=1}^n(-1)^{k-1}{n \choose k} (2^k-1)^{-1}

The \sum bit means do a sum, and the ~{n \choose k}~ term is the number of unique combinations of k objects chosen from n. But don’t worry at all about this detail; I’ve simply included the answer to show that there is a formula which gives the answer. 

With 10 coins you can plug n=10 into this expression to get the answer 4.726. With 100 coins there are some difficulties, since the calculation of ~{n \choose k}~  with n=100 is numerically unstable for many values of k. But accurate approximations to the solution are available, which don’t suffer the same numerical stability problems, and we get the answer 7.983.

So, in summary, with a fair bit of mathematics you can get exact answers to the problem I set. But much more importantly, with either good intuition or sensible reasoning you can get answers that are very similar. This latter skill is much more useful in Statistics generally, and it’s fantastic that the set of replies I received showed collective strength in this respect.

Heads dropping

coins

Here’s a fun probability problem that Benoit.Jottreau@smartodds.co.uk showed me. If you’re clever at probability, you might be able to solve it exactly; otherwise it’s easy to simulate. But as with previous problems of this type, I think it’s more interesting to find out what you would guess the answer to be, without thinking about it too deeply.

So, suppose you’ve got 10 coins. They’re fair coins, in the sense that if you toss any of them, they’re equally likely to come up heads or tails. You toss all 10 coins. You then remove the ones that come up heads. The remaining ones – the ones that come up tails – you toss again in a second round. Again, you remove any that come up heads, and toss again the ones that come up tails in a third round. And so on. In each round, you remove the coins that come up heads, and toss again the coins that come up tails. You stop once all of the coins have been removed.

The question: on average, how many rounds of this game do you need before all of the coins have been removed?

There are different mathematical ways of approaching this problem, but I’m not really interested in those. I’m interested in how good we are, collectively, at using our instincts to guess the solution to a problem of this type. So, I’d really appreciate it if you’d send me your best guess.

Actually, let’s make it a little more interesting. Can you send me an answer to a second question as well?

Second question: same game as above, but starting with 100 coins. This time, on average, how many rounds do you need before all of the coins have been removed?

Please send your answers to me directly or via this survey form.

I’ll discuss the answers you (hopefully) send me, and the problems themselves in more detail, in a subsequent post.

Please don’t fill out the survey if you solved the problem either mathematically or by simulation, though if you’d like to send me your solutions in either of those cases, I’d be very happy to look at them and discuss them with you.

 

Needles, noodles and 𝜋

A while back, on Pi Day, I sent a post celebrating the number 𝜋 and mentioned that though 𝜋 is best known for its properties in connection with the geometry of a circle, it actually crops up all over the place in mathematics, including Statistics.

Here’s one famous example…

Consider a table covered with parallel lines like in the following figure.

linesFor argument’s sake, let’s suppose the lines are 10 cm apart. Then take a bunch of needles – or matches, or something similar – that are 5 cm in length, drop them randomly onto the table, and count how many intersect one of the lines on the table.  Let’s suppose there are N needles and m of them intersect one of the lines. It turns out that N/m will be approximately 𝜋, and that the approximation is likely to improve if we repeat the experiment with a bigger value of N.

What this means in practice is that we have a statistical way of calculating 𝜋. Just do the experiment described above, and as we get through more and more needles, so the calculation of N/m is likely to lead to a better and better approximation of 𝜋.

There are various apps and so on that replicate this experiment via computer simulation, including this one, which is pretty nice. The needles which intersect any of the lines are shown in red; the others remain blue. The ratio N/m is shown in real-time, and if you’re patient enough it should get closer to the true value of 𝜋, the longer you wait. The approximation is also shown geometrically – the ratio N/m is very close to the ratio of a circle’s circumference to its diameter.

One important point though: the longer you wait, the greater will be the tendency for the approximation N/m to improve. However,  because of random variation in individual samples, it’s not guaranteed to always improve. For a while, the approximation might get a little worse, before inevitably (but perhaps slowly) starting to improve again.

In actual fact, there’s no need for the needles in this experiment to be half the distance between the lines. Suppose the ratio between the line separation and the needle length is r, then 𝜋 is approximated by

\hat{\pi} = \frac{2rN}{m}

In the simpler version above, r=1/2, which leads to the above result

\hat{\pi} = \frac{N}{m}

Now, although Buffon’s needle provides a completely foolproof statistical method of calculating 𝜋, it’s a very slow procedure. You’re likely to need very many needles to calculate 𝜋 to any reasonable level of accuracy. (You’re likely to have noticed this if you looked at the app mentioned above). And this is true of many statistical simulation procedures: the natural randomness in experimental data means that very large samples may be needed to get accurate results. Moreover, every time you repeat the experiment, you’re likely to get a different answer, at least to some level of accuracy.

Anyway… Buffon’s needle takes its name from Georges-Louis Leclerc, Comte de Buffon, a French mathematician in the 18th century who first posed the question of what the probability would be for a needle thrown at random to intersect a line. And Buffon’s needle is a pretty well-known problem in probability and Statistics.

Less well-known, and even more remarkable, is Buffon’s noodle problem. Suppose the needles in Buffon’s needle problem are allowed to be curved. So rather than needles, they are noodles(!) We drop N noodles – of possibly different shapes, but still 5 cm in length – onto the table, and count the total number of times the noodles cross a line on the table. Because of the curvature of the noodles, it’s now possible that a single noodle crosses a line more than once, so m is now the total number of line crossings, where the contribution from any one noodle might be 2 or more. Remarkably, it turns out that despite the curvature of the noodles and despite the fact that individual noodles might have multiple line crossings, the ratio N/m still provides an approximation to 𝜋 in exactly the same way it did for the needles.

This result for Buffon’s noodle follows directly from that of Buffon’s needle. You might like to try to think about why that is so. If not, you can find an explanation here.


Finally, a while back, I sent a post about Mendelian genetics. In it I discussed how Mendel used a statistical analysis of pea experiments to develop his theory of genetic inheritance. I pointed out, though, that while the theory is undoubtedly correct, Mendel’s statistical results were almost certainly too good to be true. In other words, he’d fixed his results to get the experimental results which supported his theory. Well, there’s a similar story connected to Buffon’s needle.

In 1901, an Italian mathematician, Mario Lazzarini, carried out Buffon’s needle experiment with a ratio of r=5/6. This seems like a strangely arbitrary choice. But as explained in Wikipedia, it’s a choice which enables the approximation of 355/113, which is well-known to be an extremely accurate fractional approximation for 𝜋. What’s required to get this result is that in a multiple of 213 needle throws, the same multiple of 113 needles intersect a line. In other words, 113 intersections when throwing 213 needles. Or 226 when throwing 426. And so on.

So, one explanation for Lazzarini’s remarkably accurate result is that he simply kept repeating the experiment in multiples of 213 throws until he got the answer he wanted, and then stopped. Indeed, he reported a value of N=3408, which happens to be 16 times 213. And in those 3408 throws, he reportedly got 1808 line intersections, which happens to be 16 times 113.

An alternative explanation is that Lazzarini didn’t do the experiment at all, but pretended he did with the numbers chosen as above so as to force the result to be the value that he actually wanted it to be. I know that doesn’t seem like a very Italian kind of thing to do, but there is some circumstantial evidence that supports this possibility. First, as also explained in Wikipedia:

A statistical analysis of intermediate results he reported for fewer tosses leads to a very low probability of achieving such close agreement to the expected value all through the experiment.

Second, Lazzarini reportedly described a physical machine that he used to carry out the experimental needle throwing. However, a basic study of the design of this machine shows it to be impossible from an engineering point of view.

So, like Mendel, it’s rather likely that Lazzarini invented some data from a statistical experiment just to get the answer that he was hoping to achieve. And the moral of the story? If you’re going to make evidence up to ‘prove’ your answer, build a little bit of statistical error into the answer itself, otherwise you might find statisticians in 100 years’ time proving (beyond reasonable doubt) you cheated.

Britain’s Favourite Crisps

 

As I’ve mentioned before, my aim in this blog is to raise awareness and understanding of statistical concepts and procedures, particularly with regard to potential applications in sports modelling. Often this will involved discussing particular techniques and methodologies. But sometimes it might involve simply referencing the way statistics has been used to address some particular important topic of the day.

With this latter point in mind, Channel 5 recently showed a program titled ‘Britain’s Favourite Crisps’ in which they revealed the results of a survey investigating, well, Britain’s favourite crisps. Now, if your cultural roots are not based in the UK, the complexities of crisp preference might seem as strange as the current wrangling over Brexit. But those of you who grew up in the UK are likely to be aware of the sensitivities of this issue. Crisp preferences, that is. Let’s not get started on Brexit.

A summary of the results of the survey are contained in the following diagram:

And a complete ranking of the top 20 is included here.

As you might expect for such a contentious issue, the programme generated a lot of controversy. For example:

And so on.

Personally, I’m mildly upset – I won’t say outraged exactly – at Monster Munch appearing only in the Mid-Tier. But let me try to put my own biases aside and turn to some statistical issues. These results are based on some kind of statistical survey, but this raises a number of questions. For example:

  1. How many people were included in the survey?
  2. How were they interviewed? Telephone? Internet? Person-to-person?
  3. How were they selected? Completely randomly? Or balanced to reflect certain demographics? Something else?
  4. What were they asked? Just their favourite? Or a ranking of their top 20 say?
  5. Were participants given a list of crisps to choose from, or were they given complete freedom of choice?
  6. Is it fair to consider Walkers or Pringles as single categories, when they cover many different flavours, while other crisps, such as Quavers, have just a single variety?
  7. How were results calculated? Just straight averages based on sample results, or weighted to correct demographic imbalances in the survey sample?
  8. How was the issue of non-respondents handled?
  9. How certain can we be that the presented results are representative of the wider population?
  10. Is a triangle appropriate for representing the results? It suggests the items in each row are equivalent. Was that intended? If so, is it justified by the results?

It may be that some of these questions are answered in the programme itself. Unfortunately, living outside the UK, I can’t access the programme, but those of you based in the UK can, at least for some time, here. So, if you are able to watch it and get answers to any of the questions, please post them in the comments section. But my guess is that most of the questions will remain unanswered.

So, what’s the point? Well, statistical analyses of any type require careful design and analysis. Decisions have to be made about the design and execution of an experiment, and these are likely to influence the eventual results. Consequently, the analysis itself should also take into account the way the experiment was designed, and attempt to correct for potential imbalances. Moreover, a proper understanding of the results of a statistical analysis require detailed knowledge of all aspects of the analysis, from design to analysis.

And the message is, never take results of a statistical analysis on trust. Ask questions. Query the design. Ask where the data came from. Check the methodology. Challenge the results. Ask about accuracy. Question whether the results have been presented fairly.

Moreover, remember that Statistics is as much an art as a science. Both the choice of design of an experiment and the randomness in data mean that a different person carrying out the same analysis is likely to get different results.

And all of this is as true for sports modelling as it is for the ranking of Britain’s favourite crisps.

The Datasaurus Dataset

Look at the data in this table. There are 2 rows of data labelled g1 and g2.  I won’t, for the moment, tell you where the data come from, except that the data are in pairs. So, each column of the table represents a pair of observations: (2, 1) is the first pair, (3, 5) is the second pair and so on. Just looking at the data, what would you conclude?

Scroll down once you’ve thought about this question.

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

Maybe you’re better at this stuff than me, but I wouldn’t find this an easy question to answer. Even though there are just 10 observations, and each observation contains just a pair of values, I find it difficult to simply look at the numbers and see any kind of pattern at all, either in the individual rows of numbers, or in any possible relationship between the two. And if it’s difficult in this situation, it’s bound to be much more difficult when there might be many thousands or millions of observations, and each observation might not be just a pair, but several – perhaps many – numbers.

So, not easy. But it’s a standard statistical requirement: taking a set of observations – in this case pairs – and trying to understand what they might convey about the process they come from. It’s really the beating heart of Statistics: trying to understand structure from data. Yet even with just 10 pairs of observations, the task isn’t straightforward.

To deal with this problem an important aspect of statistic analysis is the  summarisation of data – reducing the information they contain to just a few salient features. Specifically, in this case, reducing the information that’s contained in the 10 pairs of observations to a smaller number of numbers – so-called statistics – that summarise the most relevant aspects of the information that the data contain. The most commonly-used statistics, as you probably know, are:

  1. The means: the average values of each of the g1 and g2 sets of values.
  2. The standard deviations: measures of spread around the means of each of the g1 and g2 sets of values.
  3. The correlation: a measure, on a scale of -1 to 1, of the tendency for the g1 and g2 values to be related to each other.

The mean is well-known. The standard deviation is a measure of how spread out a set of values are: the more dispersed the numbers, the greater the standard deviation. Correlation is maybe less well understood, but provides a measure of the extent to which 2 sets of variables are linked to one another (albeit in a linear sense).

So, rather than trying to identify patterns in a set of 10 pairs of numbers, we reduce the data to their main features:

  • g1 mean = 2.4; g2 mean = 1.8
  • g1 standard deviation = 0.97; g2 standard deviation = 1.48
  • (g1,g2) correlation = 0.22

And from this we can start to build a picture of what the data tell us:

  1. The average value of g1 is rather greater – actually 0.6 greater – than the mean of g2, so there is a tendency for the g1 component of a pair to be bigger than the g2 component.
  2. The g2 values are more spread out than the g1 values.
  3. The positive value of correlation, albeit a value substantially lower than the maximum of 1, suggests that there is a tendency for the g1 and g2 components to be associated: bigger values of g1 tend to imply bigger values of g2.

So now let me tell you what the data are: they are the home and away scores, g1 and g2 respectively, in the latest round of games – matchday 28- in Serie A. So, actually, the summary values make quite good sense: the mean of g1 is greater than the mean of g2, which is consistent with a home advantage effect. And it’s generally accepted that home and away scores tend to be positively correlated. It’s maybe a little surprising that the standard deviation of away goals is greater than that of home goals, but with just 10 games this is very likely just to be a chance occurrence.

Which gives rise to a different issue: we’re unlikely to be interested in the patterns contained in the data from these particular 10 games. It’s much more likely we’re interested in what they might tell us about the pattern of results in a wider set of games –  perhaps Serie A games from any arbitrary matchday.

But that’s a story for another post sometime. The point of this post is that we’re simply not programmed to look at large (or even quite small) datasets and be able to see any patterns or messages they might contain.  Rather, we have to summarise data with just a few meaningful statistics in order to understand and compare them.


But actually, all of the above is just a precursor to what I actually wanted to say in this post. Luigi.Colombo@smartodds.co.uk recently forwarded the following twitter post to the quant team on RocketChat. Press the start arrow to set off the animation.

As explained in the message, every single one of the images in this animation – including the passages from one of the main images to another –  has exactly the same summary statistics. Thats to say, the mean and standard deviation of both the x- and y-values stay the same, as does the correlation between the two sets of values.

So what’s the moral here? Well, as we saw above, reduction of data to simple summary statistics is immensely helpful in getting a basic picture of the structure of data. But: it is a reduction nonetheless, and something is lost. All of the datasets in the the twitter animation have identical summary statistics, yet the data themselves are dramatically different from one image to another.

So, yes, follow my advice above and use summary statistics to understand data better. But be aware that a summary of data is just that, a summary, and infinitely many other datasets will have exactly the same summary statistics. If it’s important to you that your data look more like concentric ellipses than a dinosaur, you’d better not rely on means and standard deviations to tell you so.

Altered images

In a recent post I described the following problem which I encountered while sitting in a dentist waiting room:

Images are randomly selected from a library of images and shown on a screen. After watching the screen for a while, I notice one or more of the images is a repeat showing of an earlier image. How can I use information on the number of images observed and the number of repeats to estimate how many images there are in the entire library?

I had two great replies suggesting solutions to this problem. The first was from Nity.Raj@smartodds.co.uk

Surely the efficient thing to do is to hack the database of images so you just find out how many there are in fact, rather than estimating?

It’s the perfect answer, but I just need to run it past someone with a legal background who’s connected to Smartodds to check it’s compliant with relevant internet communication laws. Can anyone suggest somebody suitable?

The other idea was from Ian.Rutherford@smartbapps.co.uk who suggested this:

I would take the total of all the images seen and divide it by the number of times I spotted the 23 to Leigh Park to give an estimation of the number of different images

You’ll have to read the original post to understand the ’23 to Leigh Park’ bit of this answer, but you can take it as a reference to any one of the images that you’ve seen. So, let’s suppose I’ve seen 100 images, and I’ve seen one particular image that I’m interested in 4 times. Then Ian’s suggestion is to estimate the total number of images as

100/4=25

Ian didn’t explain his answer, so I hope I’m not doing him a disservice, but I think the reasoning for this solution is as follows. Suppose the population size is N and I observe v images. Then since the images occur at random, the probability I will see any particular image when a random image is shown is 1/N. So the average, or expected, number of times I will see a particular image in a sequence of v images is v/N. If I end up seeing the image t times, this means I should estimate v/N with t. But rearranging this, it means I estimate N with v/t.

It’s a really smart answer, but I think there are two slight drawbacks.

  1. Suppose, in the sequence of 100 images, I’d already seen 26 (or more) different images. In that case I’d know the estimate of 25 was bound to be an under-estimate.
  2. This estimate uses information based on the number of repeats of just one image. Clearly, the number of repeats of each of the different images I observe is equally relevant, and it must be wasteful not to use the information they contain as well.

That said, the simplicity and logic of the answer are both extremely appealing.

But before receiving these answers, and actually while waiting at the dentist, I had my own idea. I’m not sure it’s better than Nity’s or Ian’s, and it has its own drawbacks. But it tells a nice story of how methods from one area of Statistics can be relevant for something apparently unrelated.


So, imagine you’re an ecologist and there’s concern that pollution levels have led to a reduction in the number of fish in a lake. To assess this possibility you need to get an estimate of how many fish there are in the lake.  The lake is large and deep, so surface observations are not useful. And you don’t have equipment to make sub-surface measurements.

What are you going to do?

Have a think about this before scrolling down.

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

One standard statistical approach to this problem is a technique called mark and recapture. There are many variations on this method, some quite sophisticated, but we’ll discuss just the simplest, which works as follows.

A number of fish are caught (unharmed), marked and released back into the lake. Let this number of fish be n, say.

Some time later, a second sample of fish – let’s say of size K – is taken from the lake. We observe that k fish of this second sample have the mark that we applied in the first sample. So k/K is the proportion of fish in the second sample that have been marked. But since this is just a random sample from the lake, we’d expect this proportion to be similar to the proportion of marked fish in the entire lake, which will be n/N.

Expressing this mathematically, we have an approximation

k/K \approx n/N

But we can rearrange this to get:

N \approx nK/k

In other words, we could use

\hat{N}= nK/k

as an estimate for the number of fish, since we’d expect this to be a reasonable approximation to the actual number N.

So, let’s suppose I originally caught, marked and released 100 fish. I subsequently catch a further 50 fish, of which 5 are marked. Then, n=100, K=50, k=5 and so

\hat{N} =  nK/k = 100 \times 50 /5 =1000

and I’d estimate that the lake contains 1000 fish.

Now, maybe you can see where this is going. Suppose instead of a lake of fish, we have a library of images. This method would allow me to estimate the size of the population of images, just as it does a population of fish. But there’s a slight catch (if you’ll pardon the pun). When I take a sample of fish from a lake, each of the fish in the sample is unique. But when I look at a selection of images at the dentist, some of them may be repeats. So I can’t quite treat my sample of images in exactly the same way as I would a sample of fish. To get round this problem I have to ignore the repeated images within each sample. So, my strategy is this:

  1. Observe a number of the images, ignoring any repeats. Call the number of unique images n.
  2. Observe a second set of images. Let the number of unique images in this set be K, but keeping count of repeats with the first set. Let’s say the number of repeats with the first sample is k.

The estimate of the population size – for the same reasons as estimating fish population sizes – is then

\hat{N} =  nK/k.

So, suppose I chose to look at images for 10 minutes. In that period there were 85 images, but 5 of these were repeats. So, n=80. I then watch for another 5 minutes and observe 30 unique images, 4 of which were also observed in the first sample. So, n=80, K=30, m=4 and my estimate of the number of images in the database is

\hat{N} =  nK/k = 80 \times 30 /4 =600

Is this answer any better than Ian’s? I believe it uses more information available in the data, since it doesn’t focus on just one image. It’s also less likely to give an answer that is inconsistent with the data that I’ve already seen. But it does have drawbacks and limitations:

  1. Ignoring the information on repeats within each sample must also be wasteful of relevant information.
  2. The distinction between the first sample and second sample is arbitrary, and it might be that different choices lead to different answers.
  3. Keeping track of repeats within and across the two samples might be difficult in practice.

In a subsequent post I’ll do a more detailed study of the performance of the two methods. In the meantime, let me summarise what I think are the main points from this discussion:

  1. Statistical problems can occur in the most surprising places
  2. There’s usually no right or wrong way of tackling a statistical problem. One approach might be best from one point of view, while another is better from a different point of view.
  3. Statistics is a very connected subject: a technique that has been developed for one type of problem may be transferable to a completely different type of problem.
  4. Simple answers are not always be the best – though sometimes they are – but simplicity is a virtue in itself.

Having said all that, there are various conventional ways of judging the performance of a statistical procedure, and I’ll use some of these to compare my solution with Iain’s in the follow-up post. Meantime, I’d still be happy to receive alternative solutions to the problem, whose performance I can also compare against mine and Ian’s.