Here’s a fun probability problem that Benoit.Jottreau@smartodds.co.uk showed me. If you’re clever at probability, you might be able to solve it exactly; otherwise it’s easy to simulate. But as with previous problems of this type, I think it’s more interesting to find out what you would guess the answer to be, without thinking about it too deeply.

So, suppose you’ve got 10 coins. They’re fair coins, in the sense that if you toss any of them, they’re equally likely to come up heads or tails. You toss all 10 coins. You then remove the ones that come up heads. The remaining ones – the ones that come up tails – you toss again in a second round. Again, you remove any that come up heads, and toss again the ones that come up tails in a third round. And so on. In each round, you remove the coins that come up heads, and toss again the coins that come up tails. You stop once all of the coins have been removed.

The question: on average, how many rounds of this game do you need before all of the coins have been removed?

There are different mathematical ways of approaching this problem, but I’m not really interested in those. I’m interested in how good we are, collectively, at using our instincts to guess the solution to a problem of this type. So, I’d really appreciate it if you’d send me your best guess.

Actually, let’s make it a little more interesting. Can you send me an answer to a second question as well?

Second question: same game as above, but starting with 100 coins. This time, on average, how many rounds do you need before all of the coins have been removed?

Please send your answers to me directly or via this survey form.

I’ll discuss the answers you (hopefully) send me, and the problems themselves in more detail, in a subsequent post.

Please don’t fill out the survey if you solved the problem either mathematically or by simulation, though if you’d like to send me your solutions in either of those cases, I’d be very happy to look at them and discuss them with you.

Needles, noodles and 𝜋

A while back, on Pi Day, I sent a post celebrating the number 𝜋 and mentioned that though 𝜋 is best known for its properties in connection with the geometry of a circle, it actually crops up all over the place in mathematics, including Statistics.

Here’s one famous example…

Consider a table covered with parallel lines like in the following figure.

For argument’s sake, let’s suppose the lines are 10 cm apart. Then take a bunch of needles – or matches, or something similar – that are 5 cm in length, drop them randomly onto the table, and count how many intersect one of the lines on the table.  Let’s suppose there are N needles and m of them intersect one of the lines. It turns out that N/m will be approximately 𝜋, and that the approximation is likely to improve if we repeat the experiment with a bigger value of N.

What this means in practice is that we have a statistical way of calculating 𝜋. Just do the experiment described above, and as we get through more and more needles, so the calculation of N/m is likely to lead to a better and better approximation of 𝜋.

There are various apps and so on that replicate this experiment via computer simulation, including this one, which is pretty nice. The needles which intersect any of the lines are shown in red; the others remain blue. The ratio N/m is shown in real-time, and if you’re patient enough it should get closer to the true value of 𝜋, the longer you wait. The approximation is also shown geometrically – the ratio N/m is very close to the ratio of a circle’s circumference to its diameter.

One important point though: the longer you wait, the greater will be the tendency for the approximation N/m to improve. However,  because of random variation in individual samples, it’s not guaranteed to always improve. For a while, the approximation might get a little worse, before inevitably (but perhaps slowly) starting to improve again.

In actual fact, there’s no need for the needles in this experiment to be half the distance between the lines. Suppose the ratio between the line separation and the needle length is r, then 𝜋 is approximated by

$\hat{\pi} = \frac{2rN}{m}$

In the simpler version above, r=1/2, which leads to the above result

$\hat{\pi} = \frac{N}{m}$

Now, although Buffon’s needle provides a completely foolproof statistical method of calculating 𝜋, it’s a very slow procedure. You’re likely to need very many needles to calculate 𝜋 to any reasonable level of accuracy. (You’re likely to have noticed this if you looked at the app mentioned above). And this is true of many statistical simulation procedures: the natural randomness in experimental data means that very large samples may be needed to get accurate results. Moreover, every time you repeat the experiment, you’re likely to get a different answer, at least to some level of accuracy.

Anyway… Buffon’s needle takes its name from Georges-Louis Leclerc, Comte de Buffon, a French mathematician in the 18th century who first posed the question of what the probability would be for a needle thrown at random to intersect a line. And Buffon’s needle is a pretty well-known problem in probability and Statistics.

Less well-known, and even more remarkable, is Buffon’s noodle problem. Suppose the needles in Buffon’s needle problem are allowed to be curved. So rather than needles, they are noodles(!) We drop N noodles – of possibly different shapes, but still 5 cm in length – onto the table, and count the total number of times the noodles cross a line on the table. Because of the curvature of the noodles, it’s now possible that a single noodle crosses a line more than once, so m is now the total number of line crossings, where the contribution from any one noodle might be 2 or more. Remarkably, it turns out that despite the curvature of the noodles and despite the fact that individual noodles might have multiple line crossings, the ratio N/m still provides an approximation to 𝜋 in exactly the same way it did for the needles.

This result for Buffon’s noodle follows directly from that of Buffon’s needle. You might like to try to think about why that is so. If not, you can find an explanation here.

Finally, a while back, I sent a post about Mendelian genetics. In it I discussed how Mendel used a statistical analysis of pea experiments to develop his theory of genetic inheritance. I pointed out, though, that while the theory is undoubtedly correct, Mendel’s statistical results were almost certainly too good to be true. In other words, he’d fixed his results to get the experimental results which supported his theory. Well, there’s a similar story connected to Buffon’s needle.

In 1901, an Italian mathematician, Mario Lazzarini, carried out Buffon’s needle experiment with a ratio of r=5/6. This seems like a strangely arbitrary choice. But as explained in Wikipedia, it’s a choice which enables the approximation of 355/113, which is well-known to be an extremely accurate fractional approximation for 𝜋. What’s required to get this result is that in a multiple of 213 needle throws, the same multiple of 113 needles intersect a line. In other words, 113 intersections when throwing 213 needles. Or 226 when throwing 426. And so on.

So, one explanation for Lazzarini’s remarkably accurate result is that he simply kept repeating the experiment in multiples of 213 throws until he got the answer he wanted, and then stopped. Indeed, he reported a value of N=3408, which happens to be 16 times 213. And in those 3408 throws, he reportedly got 1808 line intersections, which happens to be 16 times 113.

An alternative explanation is that Lazzarini didn’t do the experiment at all, but pretended he did with the numbers chosen as above so as to force the result to be the value that he actually wanted it to be. I know that doesn’t seem like a very Italian kind of thing to do, but there is some circumstantial evidence that supports this possibility. First, as also explained in Wikipedia:

A statistical analysis of intermediate results he reported for fewer tosses leads to a very low probability of achieving such close agreement to the expected value all through the experiment.

Second, Lazzarini reportedly described a physical machine that he used to carry out the experimental needle throwing. However, a basic study of the design of this machine shows it to be impossible from an engineering point of view.

So, like Mendel, it’s rather likely that Lazzarini invented some data from a statistical experiment just to get the answer that he was hoping to achieve. And the moral of the story? If you’re going to make evidence up to ‘prove’ your answer, build a little bit of statistical error into the answer itself, otherwise you might find statisticians in 100 years’ time proving (beyond reasonable doubt) you cheated.

Britain’s Favourite Crisps

As I’ve mentioned before, my aim in this blog is to raise awareness and understanding of statistical concepts and procedures, particularly with regard to potential applications in sports modelling. Often this will involved discussing particular techniques and methodologies. But sometimes it might involve simply referencing the way statistics has been used to address some particular important topic of the day.

With this latter point in mind, Channel 5 recently showed a program titled ‘Britain’s Favourite Crisps’ in which they revealed the results of a survey investigating, well, Britain’s favourite crisps. Now, if your cultural roots are not based in the UK, the complexities of crisp preference might seem as strange as the current wrangling over Brexit. But those of you who grew up in the UK are likely to be aware of the sensitivities of this issue. Crisp preferences, that is. Let’s not get started on Brexit.

A summary of the results of the survey are contained in the following diagram:

And a complete ranking of the top 20 is included here.

As you might expect for such a contentious issue, the programme generated a lot of controversy. For example:

And so on.

Personally, I’m mildly upset – I won’t say outraged exactly – at Monster Munch appearing only in the Mid-Tier. But let me try to put my own biases aside and turn to some statistical issues. These results are based on some kind of statistical survey, but this raises a number of questions. For example:

1. How many people were included in the survey?
2. How were they interviewed? Telephone? Internet? Person-to-person?
3. How were they selected? Completely randomly? Or balanced to reflect certain demographics? Something else?
4. What were they asked? Just their favourite? Or a ranking of their top 20 say?
5. Were participants given a list of crisps to choose from, or were they given complete freedom of choice?
6. Is it fair to consider Walkers or Pringles as single categories, when they cover many different flavours, while other crisps, such as Quavers, have just a single variety?
7. How were results calculated? Just straight averages based on sample results, or weighted to correct demographic imbalances in the survey sample?
8. How was the issue of non-respondents handled?
9. How certain can we be that the presented results are representative of the wider population?
10. Is a triangle appropriate for representing the results? It suggests the items in each row are equivalent. Was that intended? If so, is it justified by the results?

It may be that some of these questions are answered in the programme itself. Unfortunately, living outside the UK, I can’t access the programme, but those of you based in the UK can, at least for some time, here. So, if you are able to watch it and get answers to any of the questions, please post them in the comments section. But my guess is that most of the questions will remain unanswered.

So, what’s the point? Well, statistical analyses of any type require careful design and analysis. Decisions have to be made about the design and execution of an experiment, and these are likely to influence the eventual results. Consequently, the analysis itself should also take into account the way the experiment was designed, and attempt to correct for potential imbalances. Moreover, a proper understanding of the results of a statistical analysis require detailed knowledge of all aspects of the analysis, from design to analysis.

And the message is, never take results of a statistical analysis on trust. Ask questions. Query the design. Ask where the data came from. Check the methodology. Challenge the results. Ask about accuracy. Question whether the results have been presented fairly.

Moreover, remember that Statistics is as much an art as a science. Both the choice of design of an experiment and the randomness in data mean that a different person carrying out the same analysis is likely to get different results.

And all of this is as true for sports modelling as it is for the ranking of Britain’s favourite crisps.

The Datasaurus Dataset

Look at the data in this table. There are 2 rows of data labelled g1 and g2.  I won’t, for the moment, tell you where the data come from, except that the data are in pairs. So, each column of the table represents a pair of observations: (2, 1) is the first pair, (3, 5) is the second pair and so on. Just looking at the data, what would you conclude?

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

Maybe you’re better at this stuff than me, but I wouldn’t find this an easy question to answer. Even though there are just 10 observations, and each observation contains just a pair of values, I find it difficult to simply look at the numbers and see any kind of pattern at all, either in the individual rows of numbers, or in any possible relationship between the two. And if it’s difficult in this situation, it’s bound to be much more difficult when there might be many thousands or millions of observations, and each observation might not be just a pair, but several – perhaps many – numbers.

So, not easy. But it’s a standard statistical requirement: taking a set of observations – in this case pairs – and trying to understand what they might convey about the process they come from. It’s really the beating heart of Statistics: trying to understand structure from data. Yet even with just 10 pairs of observations, the task isn’t straightforward.

To deal with this problem an important aspect of statistic analysis is the  summarisation of data – reducing the information they contain to just a few salient features. Specifically, in this case, reducing the information that’s contained in the 10 pairs of observations to a smaller number of numbers – so-called statistics – that summarise the most relevant aspects of the information that the data contain. The most commonly-used statistics, as you probably know, are:

1. The means: the average values of each of the g1 and g2 sets of values.
2. The standard deviations: measures of spread around the means of each of the g1 and g2 sets of values.
3. The correlation: a measure, on a scale of -1 to 1, of the tendency for the g1 and g2 values to be related to each other.

The mean is well-known. The standard deviation is a measure of how spread out a set of values are: the more dispersed the numbers, the greater the standard deviation. Correlation is maybe less well understood, but provides a measure of the extent to which 2 sets of variables are linked to one another (albeit in a linear sense).

So, rather than trying to identify patterns in a set of 10 pairs of numbers, we reduce the data to their main features:

• g1 mean = 2.4; g2 mean = 1.8
• g1 standard deviation = 0.97; g2 standard deviation = 1.48
• (g1,g2) correlation = 0.22

And from this we can start to build a picture of what the data tell us:

1. The average value of g1 is rather greater – actually 0.6 greater – than the mean of g2, so there is a tendency for the g1 component of a pair to be bigger than the g2 component.
2. The g2 values are more spread out than the g1 values.
3. The positive value of correlation, albeit a value substantially lower than the maximum of 1, suggests that there is a tendency for the g1 and g2 components to be associated: bigger values of g1 tend to imply bigger values of g2.

So now let me tell you what the data are: they are the home and away scores, g1 and g2 respectively, in the latest round of games – matchday 28- in Serie A. So, actually, the summary values make quite good sense: the mean of g1 is greater than the mean of g2, which is consistent with a home advantage effect. And it’s generally accepted that home and away scores tend to be positively correlated. It’s maybe a little surprising that the standard deviation of away goals is greater than that of home goals, but with just 10 games this is very likely just to be a chance occurrence.

Which gives rise to a different issue: we’re unlikely to be interested in the patterns contained in the data from these particular 10 games. It’s much more likely we’re interested in what they might tell us about the pattern of results in a wider set of games –  perhaps Serie A games from any arbitrary matchday.

But that’s a story for another post sometime. The point of this post is that we’re simply not programmed to look at large (or even quite small) datasets and be able to see any patterns or messages they might contain.  Rather, we have to summarise data with just a few meaningful statistics in order to understand and compare them.

But actually, all of the above is just a precursor to what I actually wanted to say in this post. Luigi.Colombo@smartodds.co.uk recently forwarded the following twitter post to the quant team on RocketChat. Press the start arrow to set off the animation.

As explained in the message, every single one of the images in this animation – including the passages from one of the main images to another –  has exactly the same summary statistics. Thats to say, the mean and standard deviation of both the x- and y-values stay the same, as does the correlation between the two sets of values.

So what’s the moral here? Well, as we saw above, reduction of data to simple summary statistics is immensely helpful in getting a basic picture of the structure of data. But: it is a reduction nonetheless, and something is lost. All of the datasets in the the twitter animation have identical summary statistics, yet the data themselves are dramatically different from one image to another.

So, yes, follow my advice above and use summary statistics to understand data better. But be aware that a summary of data is just that, a summary, and infinitely many other datasets will have exactly the same summary statistics. If it’s important to you that your data look more like concentric ellipses than a dinosaur, you’d better not rely on means and standard deviations to tell you so.

Altered images

In a recent post I described the following problem which I encountered while sitting in a dentist waiting room:

Images are randomly selected from a library of images and shown on a screen. After watching the screen for a while, I notice one or more of the images is a repeat showing of an earlier image. How can I use information on the number of images observed and the number of repeats to estimate how many images there are in the entire library?

I had two great replies suggesting solutions to this problem. The first was from Nity.Raj@smartodds.co.uk

Surely the efficient thing to do is to hack the database of images so you just find out how many there are in fact, rather than estimating?

It’s the perfect answer, but I just need to run it past someone with a legal background who’s connected to Smartodds to check it’s compliant with relevant internet communication laws. Can anyone suggest somebody suitable?

The other idea was from Ian.Rutherford@smartbapps.co.uk who suggested this:

I would take the total of all the images seen and divide it by the number of times I spotted the 23 to Leigh Park to give an estimation of the number of different images

You’ll have to read the original post to understand the ’23 to Leigh Park’ bit of this answer, but you can take it as a reference to any one of the images that you’ve seen. So, let’s suppose I’ve seen 100 images, and I’ve seen one particular image that I’m interested in 4 times. Then Ian’s suggestion is to estimate the total number of images as

$100/4=25$

Ian didn’t explain his answer, so I hope I’m not doing him a disservice, but I think the reasoning for this solution is as follows. Suppose the population size is N and I observe v images. Then since the images occur at random, the probability I will see any particular image when a random image is shown is 1/N. So the average, or expected, number of times I will see a particular image in a sequence of v images is v/N. If I end up seeing the image t times, this means I should estimate v/N with t. But rearranging this, it means I estimate N with v/t.

It’s a really smart answer, but I think there are two slight drawbacks.

1. Suppose, in the sequence of 100 images, I’d already seen 26 (or more) different images. In that case I’d know the estimate of 25 was bound to be an under-estimate.
2. This estimate uses information based on the number of repeats of just one image. Clearly, the number of repeats of each of the different images I observe is equally relevant, and it must be wasteful not to use the information they contain as well.

That said, the simplicity and logic of the answer are both extremely appealing.

But before receiving these answers, and actually while waiting at the dentist, I had my own idea. I’m not sure it’s better than Nity’s or Ian’s, and it has its own drawbacks. But it tells a nice story of how methods from one area of Statistics can be relevant for something apparently unrelated.

So, imagine you’re an ecologist and there’s concern that pollution levels have led to a reduction in the number of fish in a lake. To assess this possibility you need to get an estimate of how many fish there are in the lake.  The lake is large and deep, so surface observations are not useful. And you don’t have equipment to make sub-surface measurements.

What are you going to do?

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

One standard statistical approach to this problem is a technique called mark and recapture. There are many variations on this method, some quite sophisticated, but we’ll discuss just the simplest, which works as follows.

A number of fish are caught (unharmed), marked and released back into the lake. Let this number of fish be n, say.

Some time later, a second sample of fish – let’s say of size K – is taken from the lake. We observe that k fish of this second sample have the mark that we applied in the first sample. So k/K is the proportion of fish in the second sample that have been marked. But since this is just a random sample from the lake, we’d expect this proportion to be similar to the proportion of marked fish in the entire lake, which will be n/N.

Expressing this mathematically, we have an approximation

$k/K \approx n/N$

But we can rearrange this to get:

$N \approx nK/k$

In other words, we could use

$\hat{N}= nK/k$

as an estimate for the number of fish, since we’d expect this to be a reasonable approximation to the actual number N.

So, let’s suppose I originally caught, marked and released 100 fish. I subsequently catch a further 50 fish, of which 5 are marked. Then, n=100, K=50, k=5 and so

$\hat{N} = nK/k = 100 \times 50 /5 =1000$

and I’d estimate that the lake contains 1000 fish.

Now, maybe you can see where this is going. Suppose instead of a lake of fish, we have a library of images. This method would allow me to estimate the size of the population of images, just as it does a population of fish. But there’s a slight catch (if you’ll pardon the pun). When I take a sample of fish from a lake, each of the fish in the sample is unique. But when I look at a selection of images at the dentist, some of them may be repeats. So I can’t quite treat my sample of images in exactly the same way as I would a sample of fish. To get round this problem I have to ignore the repeated images within each sample. So, my strategy is this:

1. Observe a number of the images, ignoring any repeats. Call the number of unique images n.
2. Observe a second set of images. Let the number of unique images in this set be K, but keeping count of repeats with the first set. Let’s say the number of repeats with the first sample is k.

The estimate of the population size – for the same reasons as estimating fish population sizes – is then

$\hat{N} = nK/k$.

So, suppose I chose to look at images for 10 minutes. In that period there were 85 images, but 5 of these were repeats. So, n=80. I then watch for another 5 minutes and observe 30 unique images, 4 of which were also observed in the first sample. So, n=80, K=30, m=4 and my estimate of the number of images in the database is

$\hat{N} = nK/k = 80 \times 30 /4 =600$

Is this answer any better than Ian’s? I believe it uses more information available in the data, since it doesn’t focus on just one image. It’s also less likely to give an answer that is inconsistent with the data that I’ve already seen. But it does have drawbacks and limitations:

1. Ignoring the information on repeats within each sample must also be wasteful of relevant information.
2. The distinction between the first sample and second sample is arbitrary, and it might be that different choices lead to different answers.
3. Keeping track of repeats within and across the two samples might be difficult in practice.

In a subsequent post I’ll do a more detailed study of the performance of the two methods. In the meantime, let me summarise what I think are the main points from this discussion:

1. Statistical problems can occur in the most surprising places
2. There’s usually no right or wrong way of tackling a statistical problem. One approach might be best from one point of view, while another is better from a different point of view.
3. Statistics is a very connected subject: a technique that has been developed for one type of problem may be transferable to a completely different type of problem.
4. Simple answers are not always be the best – though sometimes they are – but simplicity is a virtue in itself.

Having said all that, there are various conventional ways of judging the performance of a statistical procedure, and I’ll use some of these to compare my solution with Iain’s in the follow-up post. Meantime, I’d still be happy to receive alternative solutions to the problem, whose performance I can also compare against mine and Ian’s.

Happy Pi day

Oops, I almost missed the chance to wish you a happy Pi day. So, almost belatedly:

Happy Pi day!!!

You probably know that Pi – or more accurately, 𝜋 – is one of the most important numbers in mathematics, occurring in many surprising places.

Most people first come across 𝜋 at school, where you learn that it’s the ratio between the circumference and the diameter of any circle. But 𝜋 also crops up in almost every other area of mathematics as well, including Statistics.  In a future post I’ll give an example of this.

Meantime, why is today Pi day? Well, today is March 14th, or 3/14 if you’re American. And the approximate value of Pi is 3.14. more accurately, here’s the value of 𝜋 to 100 digits:

3.14159 26535 89793 23846 26433 83279 50288 41971 69399 37510 58209 74944 59230 78164 06286 20899 86280 34825 34211 7067

Not enough for you? You can get the first 100,000 digits here.

But that’s just a part of the story. You probably also know that Pi is what’s known as an irrational number, which means that its decimal representation is infinite and non-repeating. And today it was announced that Pi has just been computed to an accuracy of 31.4 trillion decimal digits, beating the previous most accurate computation by nearly 9 trillion digits.

That’s impressive computing power, obviously, but how about simply remembering the digits of 𝜋? Chances are you remembered from school the first three digits of 𝜋: 3, 1, 4. But the current world record for remembering the value of 𝜋 is 70,030 digits, which is held by Suresh Kumar Sharma of India? And almost as impressively, here’s an 11-year old kid who managed 2091 digits.

Like I say, I’ll write about 𝜋’s importance in Statistics in another post.

Meantime, here’s Homer:

Love Island

A while back Harry.Hill@smartodds.co.uk gave a talk to the (then) quant team about trading strategies. The general issue is well-known: traders have to decide when to place a bet. Generally speaking they can place a bet early, when the price – the amount you get if you win the bet – is likely to be reasonably attractive. But in that case the liquidity of the market – the amount of money you can bet against – is likely to be low. Or they can wait until there is greater liquidity, but then the price is likely to be less attractive. So, given the option of a certain bet size at a stated price, should they bet now or wait in the hope of being able to make a bigger bet, albeit at a probably poorer price?

In general this is a difficult problem to tackle, and to make any sort of progress some assumptions have to be made about the way both prices and liquidity are likely to change as kick-off approaches. And Harry was presenting some tentative ideas, and pointing out some relevant research, that might enable us to get a handle on some of these issues.

Anyway, one of the pieces of work Harry referred to is a paper by F. Thomas Bruss, which includes the following type of example. You play a game where you can throw a dice (say) 10 times. Your objective is to throw a 6, at which point you can nominate that as your score, or continue.  But, here’s the catch: you only win if you throw a 6 and it’s the  final 6 in the sequence of 10 throws.

So, suppose you throw a 6 on the 3rd roll; should you stop? How about the 7th roll? Or the 9th? You can maybe see the connection with the trading issue: both problems require us to choose whether to stop or continue, based on an evaluation of the risk of what will subsequently occur.

Fast-forward a few days after Harry’s talk and I was reading Alex Bellos’s column in the Guardian. Alex is a journalist who writes about both football and mathematics (and sometimes both at the same time). His bi-weekly contributions to the Guardian take the form of mathematically-based puzzles. These puzzles are quite varied, covering everything from logic to geometry to arithmetic and so on. And sometimes even Statistics. Anyway, the puzzle I was reading after Harry’s talk is here. If you have time, take a read. Otherwise, here’s a brief summary.

It’s a basic version of Love Island. You have to choose from 3 potential love partners, but you only see them individually and sequentially. You are shown the first potential partner, and can decide to keep them or not. If you keep them, everything stops there. Otherwise you are shown the second potential partner. Again, you have to stick or twist: you can keep them, or you reject and are shown the third possibility. And in that case you are obliged to stick with that option.

In summary: once you stick with someone, that’s the end of the game. But if you reject someone, you can’t go back to them later. The question is: what strategy should you adopt in order to maximise the chances of choosing the person that you would have picked if you had seen all 3 at the same time?

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

As well as giving a clearer description of the problem, Alex’s article also contains a link to his discussion of the solution. But what’s interesting is that it’s another example of an optimal stopping problem: once we’ve seen a new potential partner, and also previous potential partners, we have to make a decision on whether to stop with what we currently have, or risk trying to get an improvement in the future, knowing that we could also end up with something/someone worse. And if we can solve the problem for love partners, we are one step towards solving the problem for traders as well.

The Love Island problem discussed by Alex is actually a special case of The Secretary Problem.  A company needs to hire a secretary and does so by individual interviews. Once they’ve conducted an interview they have to hire or reject that candidate, without the possibility of returning to him/her once rejected. What strategy should they adopt in order to try to get the best candidate? In the Love Island version, there are just 3 candidates; in the more general problem, there can be any number. With 3 choices, and a little bit of patience, you can probably find the solution yourself (or follow the links towards Alex’s discussion of the solution). But how about if you had 1000 possible love partners? (Disclaimer: you don’t).

Actually, there is a remarkably simple solution to this problem whatever the number of options to choose from: whether it’s 3, 1000, 10,000,000 or whatever. Let this number of candidates be N. Then reject all candidates up to the M’th for some value of M, but keep note of the best candidate, C say, from those M options. Then accept the first subsequent candidate who is better than C in subsequent interviews (or the last candidate if none happens to be better).

But how to choose M? Well, even more remarkably, it turns out that if N is reasonably large, the best choice for M is around N/e, where $e \approx 2.718$ is a number that crops up a lot in mathematics. For N=1000 candidates, this means rejecting the first 368 and then choosing the first that is better than the best of those. And one more remarkable thing about this result: the probability that the candidate selected this way is actually the best out of all the available candidates is 1/e, or approximately 37%, regardless of the value of N.

With N=3, the value of N is too small for this approximate calculation of M to be accurate, but if you calculated the solution to the problem – or looked at Alex’s – you’ll see that the solution is precisely of this form, with M=2 and a probability of 50% of picking the best candidate overall.

Anyway, what I really like about all this is the way things that are apparently unconnected – Love Island, choosing secretaries, trading strategies – are fundamentally linked once you formulate things in statistical terms. And even if the solution in one of the areas is too simple to be immediately transferable to another, it might at least provide useful direction.

Mr. Greedy

In parallel to my series of posts on famous statisticians, I also seem to be running a series of posts on characters from the Mr. Men books. Previously we had a post about Mr. Wrong. And now I have to tell you about Mr. Greedy. In doing so, you’ll hopefully learn something about the limitation of Statistics.

It was widely reported in the media last weekend (here, here and here for example) that a recent study had shown that the Mr. Greedy book is as complex a work of literature as various American classics including ‘Of Mice and Men’ and ‘The Grapes of Wrath’, each by John Steinbeck, the latter having won the Pulitzer prize for literature.

To cut a long story short, the authors of the report have developed a method of rating the readability of a book, based essentially on the complexity and phrasing of the words that it uses. They’ve done this by measuring these features for a large number of books, asking people to read the books, measuring how much they understood, and then creating a map from one to the other using standard regression techniques from Statistics. A detailed, though – irony alert! – not very easily readable, description of the analysis is given here.

The end result of this process is a formula which takes the text of a book and converts it into a ‘readability’ score. Mr. Greedy got a score of 4.4, ‘Of Mice and Men’ got 4.5 and ‘The Grapes of Wrath’ got 4.9. The most difficult book in the database was ‘Gulliver’s Travels’, with a score of 13.5. You can check the readability index value – labelled BL for ‘Book Level’ – for any book in the database by using this dialog search box.

So, yes, Mr. Greedy is almost as complex a piece of literature as the Steinbeck classics.

But… there’s a catch, of course. Any statistical analysis is limited to its own terms of reference, which in this case means that comprehension is measured in a strictly literal sense; not comprehension in a narrative sense. In other words, no attempt was made to assess whether readers understood the sum total in a literary sense of what they were reading, just the individual words and sentences. As such, the values 4.4, 4.5 or anything else say nothing about how difficult a book is to read in terms of narrative comprehension. Sure, the words and sentence structure of Mr. Greedy and The Grapes of Wrath are of similar complexity, but having understood the words in both, understanding the full meaning of Mr. Greedy is likely to be an easier task.

Does this have any relevance at all to sports modelling? Admittedly, not much. Except, it’s always important to understand what has, and has not, been included in a sports model. For example, in a football model based only on goals, when using predictions, it’s relevant to consider making adjustments if you are aware that a team has been especially unlucky in previous games (hit the post; marginal offside; etc etc). But if the model itself already included data of this type in its formulation, then it’s likely to be incorrect to make further adjustments, as doing so would be to double-count the effects.

In summary, if you are using a statistical model or analysis, make sure you know what it includes so as to avoid double-counting in sports models, or buying your 2 year-old nephew a Pulitzer prize winning American masterpiece for their birthday.

Ernie is dead, long live Ernie

Oh no, this weekend they killed Ernie

Well, actually, not that one. This one…

No, no, no. That one died some time ago. This one…

But don’t worry, here’s Ernie (mark 5)…

Let me explain…

Ernie (Electronic Random Number Indicator Equipment) is the acronym of the random number generator that is used by the government’s National Savings and Investments (NSI) department for selecting Premium Bond winners each month.

Premium bonds are a form of savings certificates. But instead of receiving a fixed or variable interest rate paid at regular intervals, like most savings accounts, premium bonds are a gamble. Each month a number of bonds from all those in circulation are selected at random and awarded prizes, with values ranging from ￡25 to ￡1,000,000. Overall, the annual interest rate is currently around 1.4%, but with this method most bond holders will receive 0%, while a few will win many times more than the actual bond value of ￡1, up to one million pounds.

So, your initial outlay is safe when you buy a premium bond – you can always cash them in at the price you paid for them – but you are gambling with the interest.

Now, the interesting thing from a statistical point of view is the monthly selection of the winning bonds. Each month there are nearly 3 million winning bonds, most of which win the minimum prize of ￡25, but 2 of which win the maximum of a million pounds. All these winning bonds have to be selected at random. But how?

As you probably know, the National Lottery is based on a single set of numbers that are randomly generated through the physical mechanism of the mixing and selection of numbered balls. But this method of random number generation is completely impractical for the random selection of several million winning bonds each month. So, a method of statistical simulation is required.

In a previous post we already discussed the idea of simulation in a statistical context. In fact, it turns out to be fairly straightforward to generate mathematically a series of numbers that, to all intents and purposes, look random. I’ll discuss this technique in a future post, but the basic idea is that there are certain formulae which, when used recursively, generate a sequence of numbers that are essentially indistinguishable from a series of random numbers.

But here’s the thing: the numbers are not really random at all. If you know the formula and the current value in the sequence, you can calculate exactly the next value in the sequence. And the next one. And so on.

Strictly, a sequence of numbers generated this way is called ‘pseudo-random’, which is a fancy way of saying ‘pretend-random’. They look random, but they’re not. For most statistical purposes, the difference between a sequence that looks random and is genuinely random is unimportant, so this method is used as the basis for simulation procedures. But for the random selection of Premium Bond winners, there are obvious logistic and moral problems in using a sequence of numbers that is actually predictable, even if it looks entirely random.

For this reason, Ernie was invented. Ernie is a random number generator. But to ensure the numbers are genuinely random, it incorporates a genuine physical process whose behaviour is entirely random. A mathematical representation of the state of this physical process then leads to the random numbers.

The very first Ernie is shown in the second picture above. It was first used in 1957, was the size of a van and used a gas neon diode to induce the randomness. Though effective, this version of Ernie was fairly slow, generating just 2000 numbers per hour. It was subsequently killed-off and replaced with ever-more efficient designs over the years.

The third picture above shows Ernie (mark 4), which has been in operation from 2004 up until this weekend. In place of gas diodes, it used thermal noise in transistors to generate the required randomness, which in turn generated the numbers. Clearly, in terms of size, this version was a big improvement on Ernie (mark 1), being about the size of a normal PC. It was also much more efficient, being able to generate one million numbers in an hour.

But Ernie (mark 4) is no more. The final picture above shows Ernie (mark 5), which came into operation this weekend, shown against the tip of a pencil. It’s essentially a microchip. And of course, the evolution of computing equipment the size of a van to the size of a pencil head over the last 60 years or so is a familiar story. Indeed Ernie (mark 5) is considerably faster – by a factor of 42.5 or so – even compared to Ernie (mark 4), despite the size reduction. But what really makes the new version of Ernie stand out is that the physical process that induces the randomness has fundamentally changed. One way or another, all the previous versions used thermal noise to generate the randomness; Ernie (mark 5) uses quantum random variation in light signals.

More information on the evolution of Ernie can be found here. A slightly more technical account of the way thermal noise was used to generate randomness in each of the Ernie’s up to mark 4 is given here. The basis of the quantum technology for Ernie mark 5 is that when a photon is emitted towards a semi-transparent surface, is either reflected or transmitted at random. Converting these outcomes into 0/1 bit values, forms the building block of random number generation.

Incidentally, although the randomness in the physical processes built into Ernie should guarantee that the numbers generated are random, checks on the output are carried out by the Government Actuary’s Department to ensure that the output can genuinely be regarded as random. In fact they apply four tests to the sequence:

1. Frequency: do all digits occur (approximately) equally often?
2. Serial: do all consecutive number pairs occur (approximately) equally often?
3. Poker: do poker combinations (4 identical digits; 3 identical digits; two pairs; one pair; all different) occur as often as they should in consecutive numbers?
4. Correlation: do pairs of digits at different spacings in bond numbers have approximately the correct correlation that would be expected under randomness?

In the 60 or so years that premium Bonds have been in circulation, the monthly numbers generated by each of the successive Ernie’s have never failed to pass these tests.

However:

Finally, in case you’re disappointed that I started this post with a gratuitous reference to Sesame Street which I didn’t follow-up on, here’s a link to 10 facts and statistics about Sesame Street.

The origin of all chavs upon this earth

This is a true story which includes an illustration of how interesting statistical questions can arise in simple everyday life. It’s a bit long though, so I’ll break it down into two posts. In this one, I’ll give you the background information. In a subsequent post, I’ll discuss a possible solution to the problem that arises.

As many of you know, I live in Italy. Actually, in a small town called Belluno in the north-east of Italy, on the edge of the dolomites. It’s great, but like most people’s life journey, my route here hasn’t been straightforward.

I grew up on a dismal overflow council estate called Leigh Park, on the distant outskirts of Portsmouth. Leigh Park was once the largest council estate in Europe and, according to this article, “could well be the origin of all chavs upon this earth”. (Just in case you’re unfamiliar with the term chav, Wikipedia gives this definition: “A pejorative epithet used in the United Kingdom to describe a particular stereotype of anti-social youth dressed in sportswear”. Explains a lot, right?)

Anyway, the other day I had to take my son to the dentist in Belluno for a check-up. The waiting area in the dentist’s has recently been refurbished, and they’ve installed a large-screen TV on the main wall. But instead of showing anything interesting, the TV just seems to flip through random images: pictures of animals; of paintings; of architecture; of cities; of people; of pretty much anything. It’s probably meant to be soothing or distracting while you’re waiting for your teeth to be drilled.

So, I sat down and started looking at this TV. And the first image I saw was of a single-decker bus with destination Leigh Park (!), a little bit like this…

My first thought, obviously, was that this was a remarkably unlikely coincidence: a TV screen in Belluno, Italy, showing the image of a bus heading towards the completely unremarkable housing estate I grew up on in England. But we’ve discussed this kind of issue before: our lives are filled with many random events each day, and we only notice the ones that are coincidences. So though it seems improbable that something like this could occur, it’s much less improbable when you balance it against the many, many unremarkable things in a day which also occur.

But the main theme of the story is something different…

I wanted to point out this coincidence – which connects to part of his own family history – to my son, but by the time I managed to drag his attention away from playing on his phone, the image had changed to something else. Having nothing better to do – my potential company for this visit was just playing on his phone, remember – I kept watching the TV screen. Now, although the images kept changing, I noticed after a while that some of the images had repeated. Not in any systematic order, but apparently at random. So, my best guess is that the screen was showing images from a fixed library of pictures in a random order. As such, the image of the Leigh Park bus would show up again at some point, but the time it would take to show up would depend on the size of the library of images. If there were just a few pictures, it would probably show up again very soon; if there were very many pictures, it would most likely take a long time.

So, here’s the question:

How could I estimate the number of images in the library being shown randomly on the TV?

This seems like a reasonable sort of question to ask. I have some data – a list of images I’ve observed, together with counts of the numbers of repeats I’ve seen. And the more often I see repeats, the smaller I might expect the number of available images to be. But what strategy could I use, either based on the images I’d already observed, or by also observing extra images, to estimate the number of images in the entire library?

I have an idea of how to tackle this problem, and I’ll discuss it in a future post. But I don’t think my idea is likely to be the best approach, and I’d be interested if anyone else has an alternative, which might well prove to be better. So, please think about this problem, and if you have suggestions of your own, please send them to me at stuart.coles1111@gmail.com. I’ll include discussion of any ideas I receive in the subsequent post.