# Nokia 3310

Whatever happened to the Nokia 3310, and what’s that got to do with sports data?

Many of you will know Rasmus Ankerson from his involvement with both Brentford and Midtjylland. Maybe you’ve also seen this video of a TED talk Rasmus gave a while back, but I’ve only just come across it. I think it’s interesting because there are now plenty of articles, books and – ahem – blogs, which emphasise the potential for statistics and data analytics in both sports and gambling. But Rasmus’s talk here goes in the other direction and argues that since data analytics has been proven as a valuable tool to assist gambling on sports, there are lessons that can be learned for leaders of business and industry. The main themes are

1. In any process where there’s an element of chance, it’s important to recognise that good and bad results are not just a function of good and bad performance, but also of good and bad luck;
2. There are potentially huge gains in trying to identify the aspects of performance that determine either good or bad results, notwithstanding the interference effects of luck.

In other words, businesses, like football teams, have results that are part performance-driven and part luck. Signal and noise, if you like. Rasmus argues that good business, like good football management, is about identifying what it is that determines the signal, while mitigating for the noise. And only by adopting this strategy can companies, like Nokia, avoid the type of sudden death that happened to the 3310. Or as Rasmus puts it: “RIP  at gadgets graveyard”.

Anyway, Rasmus’s talk is a great watch, partly because of the message it sends about the importance of Statistics to both sport and industry, but also because it includes something about the history of the relationship between Smartodds, Brentford and Midtjylland. Enjoy.

# Who wants to win £194,375?

In an earlier post I included a link to Oscar predictions by film critic Mark Kermode over the years, which included 100% success rate across all of the main categories in a couple of years. I also recounted his story of how he failed to make a fortune in 1992 by not knowing about accumulator bets.

Well, it’s almost Oscar season, and fabien.mauroy@smartodds.co.uk pointed me to this article, which includes Mark’s personal shortlist for the coming awards. Now, these aren’t the same as predictions: in some year’s, Mark has listed his own personal favourites as well as what he believes to be the likely winners, and there’s often very little in common. On the other hand, these lists have been produced prior to the nominations, so you’re likely to get better prices on bets now, rather than later. You’ll have to be quick though, as the nominations are announced in a couple of hours.

Anyway, maybe you’d like to sift through Mark’s recommendations, look for hints as to who he thinks the winner is likely to be, and make a bet accordingly. But if you do make a bet based on these lists, here are a few things to take into account:

1. Please remember the difference between an accumulator bet and single bets;
3. Please don’t blame me if you lose.

If Mark subsequently publishes actual predictions for the Oscars, I’ll include a link to those as well.

Update: the nominations have now been announced and are listed here. Comparing the nominations with Mark Kermode’s own list, the number of nominations which appear in Mark’s personal list for each category are as follows:

Best Picture: 1

Best Director: 2

Best Actor: 1

Best Actress: 2

Best supporting Actor: 3

Best supporting Actress: 1

Best Score: 2

In each case except Best Picture, there are 5 nominations and Mark’s list also comprised 5 contenders. For Best Picture, there are 8 nominations, though Mark only provided 5 suggestions.

So, not much overlap. But again, these weren’t intended to be Mark’s predictions. They were his own choices. I’ll aim to update with Mark’s actual predictions if he publishes them.

# The gene genie

One of the most remarkable advances in scientific understanding over the last couple of hundred years has been Mendelian genetics. This theory explains the basics of genetic inheritance, and is named after its discoverer, Gregor Mendel, who developed the model based on observations of the characteristics of peas when cross-bred from different varieties. In his most celebrated experiment, he crossed pure yellow with pure green peas, and obtained a generation consisting of only yellow peas. But in the subsequent generation, when these peas were crossed, he obtained a mixed generation of yellow and green peas. Mendel constructed the theory of genes and alleles to explain this phenomenon, which subsequently became the basis of modern genetic science.

You probably know all this anyway, but if you’re interested and need a quick reminder, here’s a short video giving an outline of the theory.

Mendel’s pea experiment was very simple, but from the model he developed he was able to calculate the proportion of peas of different varieties to be expected in subsequent generations. For example, in the situation described above, the theory suggests that there would be no green peas in the first generation, but around 1/4 of the peas in the second generation would be expected to be green.

Mendel’s theory extends to more complex situations; in particular it allows for the inheritance of multiple characteristics. In the video, for example, the characteristic for peas to be yellow/green is supplemented by their propensity to be round/wrinkled. Mendel’s model leads to predictions of the proportion of peas in each generation when stratified  by both these characteristics: round and green, or yellow and wrinkled etc etc.

The interesting thing from a statistical point of view is the way Mendel verified his theory. All scientific theories go through the same validation process: first there are some observations; second those observations lead to a theory; and third there is a detailed scrutiny of further observations to ensure that they are consistent with the theory. If they are, then the theory stands, at least until there are subsequent observations which violate the theory, or a better theory is developed to replace the original.

Now, where there is randomness in the observations, the procedure of ensuring that the observations are in agreement with the theory is more complicated. For example, consider the second generation of peas in the experiment above. The theory suggests that, on average, 1/4 of the peas should be green. So if we take 100 peas from the second generation, we’d expect around 25 of them to be green. But that’s different from saying exactly 25 should be green. Is it consistent with the theory if we get 30 green peas? Or 40? At what point do we decide that the experimental evidence is inconsistent with the theory? This is the substance of Statistics.

Actually, the theory of Mendelian inheritance can be expressed entirely in terms of statistical models. There is a specific probability that certain characteristics are passed on from parents to offspring, and this leads to expected proportions of different types in subsequent generations. And expressed this way, we don’t just learn that 1/4 of second generation peas should be green, but also the probability that in a sample of 100 we get 30, 40 or any number of green peas.

And this leads to something extremely interesting: Mendel’s experimental results are simply too good to be true. For example – though I’m actually making the numbers up here – in repeats of the simple pea experiment he almost always got something very close to 25 green peas out of 100. As explained above, the statistics behind Mendelian inheritance do indeed say that he should have got an average of 25 per population. But the same theory also implies that 20 or 35 green peas out of 100 are entirely plausible, and indeed a spread of experimental results between 20 and 35 is to be expected. But, each of Mendel’s experiments gave a number very close to 25. Ironically, if these really were the experimental results, they would be in violation of the theory, which expects not just an average of 25, but with an appropriate amount of variation around that figure.

So, Mendel’s experimental results were actually a primitive example of fake news. But here’s the thing: Mendel’s theory has subsequently been shown to be correct, even if it seems likely that the evidence he presented had been manipulated to strengthen its case. In modern parlance, Mendel focused on making sure his results supported the predicted average, but failed to appreciate that the theory also implied something about the variation in observations. So even if the experimental results were fake news, the theory itself has been shown to be anything but fake.

To be honest, there is some academic debate about whether Mendel cheated or not. As far as I can tell though, this is largely based on the assumption that since he was also a monk and a highly-regarded scientist, cheating would have been out of character. Nobody really denies the fact that the statistics really are simply too good to be true. Of course, in the end, it really is all academic, as the theory has been proven to be correct and is the basis for modern genetic theory. If interested, you can follow the story a little further here.

Incidentally, the fact that statistical models speak about variation as well as about averages is essential to the way they get used in sports modelling. In football, for example, models are generally estimated on the basis of the average number of goals a team is expected to score. But the prediction of match scores as a potential betting aid requires information about the variation in the number of goals around the average value. And though Mendel seems not to have appreciated the point, a statistical model contains information on both averages and variation, and if a model is to be suitable for data, the data will need to be consistent with the model in terms of both aspects.

# Pulp Fiction (Our Esteemed Leader’s cut)

The previous post had a cinematic theme. That got me remembering an offsite a while back where Matthew.Benham@smartbapps.co.uk gave a talk that I think he called ‘Do the Right Thing’, which is the title of a 1989 Spike Lee film. Midway through his talk Matthew gave a premiere screening of his own version of a scene from Pulp Fiction. Unfortunately, I’ve been unable to get hold of a copy of Matthew’s cut, so we’ll just have to make do with the inferior original….

The theme of Matthew’s talk was the importance of always acting in relation to best knowledge, even if it contradicts previous actions taken when different information was available. So, given the knowledge and information you had at the start of a game, you might have bet on team A. But if the game evolves in such a way that a bet on team B becomes positive value, you should do that. Always do the right thing. And the point of the scene from Pulp Fiction? Don’t let pride get in the way of that principle.

These issues will make a great topic for this blog sometime. But this post is about something else…

Dependence is a big issue in Statistics, and we’re likely to return to it in different ways in future posts. Loosely speaking, two events are said to be independent if knowing the outcome of one, doesn’t affect the probabilities of the outcomes of the other. For example, it’s usually reasonable to treat the outcomes of two different football matches taking place on the same day as independent. If we know one match finished 3-0, that information is unlikely to affect any judgements we might have about the possible outcomes of a later match. Events that are not independent are said to be dependent: in this case, knowing the outcome of one will affect the outcome of the other.  In tennis matches, for example, the outcome of one set tends to affect the chances of who will win a subsequent set, so set winners are dependent events.

With this in mind, let’s follow-up the discussion in the previous 2 posts (here and here) about accumulator bets. By multiplying prices from separate bets together, bookmakers are assuming that the events are independent. But if there were dependence between the events, it’s possible that an accumulator offers a value bet, even if the individual bets are of negative value. This might be part of the reason why Mark Kermode has been successful in several accumulator bets over the years (or would have been if he’d taken his predictions to the bookmaker and actually placed an accumulator bet).

Let me illustrate this with some entirely made-up numbers. Let’s suppose ‘Pulp Fiction (Our Esteemed Leader’s cut)’, is up for a best movie award, and its upstart director, Matthew Benham, has also been nominated for best director. The numbers for single bets on PF and MB are given in the following table. We’ll suppose the bookmakers are accurate in their evaluation of the probabilities, and that they guarantee themselves an expected profit by offering prices that are below the fair prices (see the earlier post).

True Probability Fair Price Bookmaker Price
Best Movie: PF 0.4 2.5 2
Best Director: MB 0.25 4 3.5

Because the available prices are lower than the fair prices and the probabilities are correct, both individual bets have negative value (-0.2 and -0.125 respectively for a unit stake). The overall price for a PF/MB accumulator bet is 7, which assuming independence is an even poorer value bet, since the expected winnings from a unit stake are

$0.4 \times 0.25 \times 7 -1 = -0.3$

However, suppose voters for the awards tend to have similar preferences across categories, so that if they like a particular movie, there’s an increased chance they’ll also like the director of that movie. In that case, although the table above might be correct, the probability of MB winning the director award if PF (MB cut) is the movie winner is likely to be greater than 0.25. For argument’s sake, let’s suppose it’s 0.5. Then, the expected winnings from a unit stake accumulator bet become

$0.4 \times 0.5 \times 7 -1 = 0.4$

That’s to say, although the individual bets are still both negative value, the accumulator bet is extremely good value. This situation arises because of the implicit assumption of independence in the calculation of accumulator prices. When the assumption is wrong, the true expected winnings will be different from those implied by the bookmaker prices, potentially generating a positive value bet.

Obviously with most accumulator bets – like multiple football results – independence is more realistic, and this discussion is unhelpful. But for speciality bets like the Oscars, or perhaps some political bets where late swings in votes are likely to affect more that one region, there may be considerable value in accumulator bets if available.

If anyone has a copy of Our Esteemed Leader’s cut of the Pulp Fiction scene on a pen-drive somewhere, and would kindly pass it to me, I will happily update this post to include it.

# How to not win ￡194,375

In the previous post we looked at why bookmakers like punters to make accumulator bets: so long as a gambler is not smart enough to be able to make positive value bets, the bookmaker will make bigger expected profits from accumulator bets than from single bets. Moreover, even for smart bettors, if any of their individual bets are not smart, accumulator bets may also favour the bookmaker.

With all this in mind, here’s a true story…

Mark Kermode is a well-known film critic, who often appears on BBC TV and radio. In the early 90’s he had a regular slot on Danny Baker’s Radio 5 show, discussing recent movie releases etc. On one particular show early in 1992, chatting to Danny, he said he had a pretty good idea of how most of the important Oscars would be awarded that year. This was actually before the nominations had been made, so bookmaker prices on award winners would have been pretty good and since Radio 5 was a predominantly sports radio station, Danny suggested Mark make a bet on the basis of his predictions.

Fast-forward a few months to the day after the Oscar awards and Danny asked Mark how his predictions had worked out. Mark explained that he’d bet on five of the major Oscar awards and they’d all won. Danny asked Mark how much he’d won and he replied that he’d won around ￡120 for a ￡25 stake.  Considering the difficulty in predicting five correct winners, especially before nominations had been made, this didn’t seem like much of a return, and Danny Baker was incredulous. He’d naturally assumed that Mark would have placed an accumulator bet with the total stake of ￡25, whereas what Mark had actually done was place individual bets of ￡5 on each of the awards.

Now, I’ve no idea what the actual prices were, but since the bets were placed before the nominations were announced, it’s reasonable to assume that the prices were quite generous. For argument’s sake, let’s suppose the bets on each of the individual awards  had a price of 6. Mark then placed a ￡5 bet on each, so he’d have made a profit of ￡25 per bet, for an overall profit of ￡125. Now suppose, instead, he’d made a single accumulator bet on all 5 awards. In this case he’d have made a profit of

$\pounds 25 \times 6 \times 6 \times 6 \times 6 \times 6 -\pounds 25 = \pounds 194,375$

Again, I’ve no idea if these numbers are accurate or not, but you get the picture. Had Mark made the accumulator bet that Danny intended, he’d have made a pretty big profit. As it was, he won enough for a night out with a couple of mates at the cinema, albeit with popcorn included.

Of course, the risk you take with an accumulator is that it just takes one bet to fail and you lose everything. By placing 5 single bets Mark would still have won ￡95 if one of his predictions had been wrong, and would even make a fiver if he got just one prediction correct. But by not accumulating his bets, he also avoided the possibility of winning ￡194,375 if all 5 bets came in. Which they did!

So, what’s the story here? Though an accumulator is a poor value bet for mug gamblers, it may be an extremely valuable bet for sharp gamblers, and the evidence suggests (see below) that Mark Kermode is sharper than the bookmakers for Oscar predictions.

Is Mark Kermode really sharper than the bookmakers for Oscar predictions? Well, here’s a list  of his predictions for the main 6 (not 5) categories for the years 2006-2017. Mark predicted all 6 categories with 100% accuracy twice in twelve years. I guess that these predictions weren’t always made before the nominations, so the prices are unlikely to be as good as in the example described above. But still, the price on a 6-fold accumulator will have been pretty good regardless. And he’d have won twice, in addition to the 1992 episode (and possibly more often in the intervening years for which I don’t have data). Remarkably, he would have won again in 2017 if the award for best movie had gone to La La Land, as was originally declared winner, rather than Moonlight, which was the eventual winner.

Moral: try to find out Mark’s predictions for the 2019 Oscars and don’t make the mistake of betting singles!

And finally, here’s Mark telling the story of not winning something like￡194,375 in his own words:

# Bookmakers love accumulators

You probably know about accumulator, or so-called ‘acca’, bets. Rather than betting individually on several different matches, in an accumulator any winnings from a first bet are used as the stake in a second bet.  If either bet loses, you lose, but if both bets win, there’s the potential to make more money than is available from single bets due to the accumulation of the prices. This process can be applied multiple times, with the winnings from several bets carried over as the stake to a subsequent bet, and the total winnings if all bets come in can be substantial. On the downside, it just takes one bet to lose and you win nothing.

Bookmakers love accumulators, and often apply special offers – as you can see in the profile picture above – to encourage gamblers to make such bets. Let’s see why that’s the case.

Consider a tennis match between two equally-matched players. Since the players are equally-matched, it’s reasonable to assume that each has a probability 0.5 of winning. So if a bookmaker was offering fair odds on the winner of this match, he should offer a price of 2 on either player, meaning that if I place a bet of 1 unit I will receive 2 units (including the return of my stake) if I win. This makes the bet fair, in the sense that my expected winnings – the amount I would win on average if the game were repeated  many times – is zero. This is because

$(1/2 \times 2) + (1/2 \times 0) -1 = 0$

That’s the sum of the probabilities multiplied by the prices, take away the stake.

The bet is fair in the sense that, if the match were repeated many times, both the gambler and the bookmaker would expect neither to win nor lose. But bookmakers aren’t in the business of being fair; they’re out to make money and will set lower prices to ensure that they have long-run winnings. So instead of offering a price of 2 on either player, they might offer a price of 1.9. In this case, assuming gamblers split their stakes evenly across two players, bookmakers will expect to win the following proportion of the total stake

$1-1/2\times(1/2 \times 1.9) - 1/2\times (1/2 \times 1.9)=0.05$

In other words, bookmakers have a locked-in 5% expected profit. Of course, they might not get 5%. Suppose most of the money is placed on player A, who happens to win. Then, the bookmaker is likely to lose money. But this is unlikely: if the players are evenly matched, the money placed by different gamblers will probably be evenly spread between the two players. And if it’s not, then the bookmakers can adjust their prices to try to encourage more bets on the less-favoured side.

Now, in an accumulator bet, the prices are multiplied. It’s equivalent to taking all of your winnings from a first bet and placing them on a second bet. Then those winnings are placed on the outcome of a third bet, and so on. So if there are two tennis matches, A versus B and C versus D, each of which is evenly-matched, the fair and actual prices on the accumulator outcomes are as follows:

Accumulator Bet A-C A-D B-C B-D
Fair Price 4 4 4 4
Actual Price 3.61 3.61  3.61 3.61

The value 3.61 comes from taking the prices of the individual bets, 1.9 in each case, and multiplying them together. It follows that the expected profit for the bookmaker is

$1-4\times 1/4\times(1/4 \times 3.61) = 0.0975$.

So, the bookmaker profit is now expected to be almost 10%. In other words, with a single accumulator, bookmakers almost double their expected profits. With further accumulators, the profits increase further and further. With 3 bets it’s over 14%; with 4 bets it’s around 18.5%. Because of this considerable increase in expected profits with accumulator bets, bookmakers can be ‘generous’ in their offers, as the headline graphic to this post suggests. In actual fact, the offers they are making are peanuts compared to the additional profits they make through gamblers making accumulator bets.

However… all of this assumes that the bookmaker sets prices accurately. What happens if the gambler is more accurate in identifying the fair price for a bet than the bookmaker? Suppose, for example, a gambler reckons correctly that the probabilities for players A and C to win are 0.55 rather than 0.5. A single stake bet spread across the 2 matches would then generate an expected profit of

$0.55\times(1/2 \times 1.9) + 0.55\times (1/2 \times 1.9) -1 = 0.045$

On the other hand, the expected profit from an accumulator bet on A-C is

$(0.55\times1.9) \times (0.55\times1.9) -1 = 0.092$

In other words, just as the bookmaker increases his expected profit through accumulator bets when he has an advantage per single bet, so does the gambler. So, bookmakers do indeed love accumulators, but not against smart gamblers.

In the next post we’ll find out how not knowing the difference between accumulator and standard bets cost one famous gambler a small fortune.

Actually, the situation is not quite as favourable for smart gamblers as the above calculation suggests. Suppose that the true probabilities for a win for A and C are 0.7 and 0.4, which still averages at 0.55. This situation would arise, for example, if the gambler was using a model which performed better than he realised for some matches, but worse than he realised for others.

The expected winnings from single bets remain at 0.045. But now, the expected winnings from an accumulator bet are just:

$(0.7\times1.9) \times (0.4\times1.9) -1 = 0.011,$

which is considerably lower. Moreover, with different numbers, the expected winnings from the accumulator bet could be negative, even though the expected winnings from separate bets is positive. (This would happen, for example, if the win probabilities for A and C were 0.8 and 0.3 respectively.)

So unless the smart gambler is genuinely smart on every bet, an accumulator bet may no longer be in his favour.

# “Random”

You probably remember the NFL quarterback Colin Kaepernick who started the protest against racism in the US by kneeling during the national anthem. In an earlier post  I discussed how his statistics suggested he was being shunned by NFL teams due to his political stance. And in a joint triumph for decency and marketing, he subsequently became the current face of Nike.

Since I now follow Kaepernick on Twitter, I recently received a tweet sent by Eric Reid of the Carolina Panthers. Reid was the first player to kneel alongside Kaepernick when playing for the San Francisco 49ers. But when his contract expired in March 2018, Reid also struggled to find a new club, despite his form suggesting he’d be an easy selection. Eventually, he joined Carolina Panthers after the start of the 2018-19 season, and opened a dispute with the NFL, claiming that, like Kaepernick, he had been shunned by most teams as a consequence of his political actions.

This was his tweet:

The ‘7’ refers to the fact that Reid had been tested seven times since joining the Panthers in the standard NFL drug testing programme, and the “random” is intended ironically. That’s to say, Reid is implying that he’s being tested more often than is plausible if tests are being carried out randomly: in other words, he’s being victimised for the stand he’s taking against the NFL

Reid is quoted as saying:

I’ve been here 11 weeks, I’ve been drug-tested seven times. That has to be statistically impossible. I’m not a mathematician, but there’s no way that’s random.

Well, let’s get one thing out of the way first of all: the only things that are statistically impossible are the things that are actually impossible. And since it’s possible that a randomised allocation of tests could lead to seven or more tests in 11 weeks, it’s certainly not impossible, statistically or otherwise.

However… Statistics is almost never about the possible versus the impossible; yes versus no; black versus white (if you’ll excuse the double entendre). Statistics is really about degrees of belief. Does the evidence suggest one version is more likely than another? And to what extent is that conclusion reliable?

Another small technicality… it seems that the first of Reid’s drug tests was actually a mandatory test that all players have to take when signing on for a new team. So actually, the question is whether the subsequent 6 tests in 11 weeks are unusually many if the tests are genuinely allocated randomly within the team roster.

On the face of it, this is a simple and standard statistical calculation. There are 72 players on a team roster and 10 players each week are selected for testing. So, under the assumption of random selection, the probability that any one player is tested any week is 10/72. Standard results then imply that the probability of a player being selected on exactly 6 out of 11 occasions – using the binomial distribution for those of you familiar with this stuff – is around 0.16%, while the probability of being tested 6 times or more is 0.17%. On this basis, there’s only a 17 in 10,000 chance that Reid would have been tested at least as often as he has been under a genuinely random procedure, and this would normally be considered small enough to provide evidence that the procedure is not random, and that Reid has been tested unduly often.

However, we need to be a bit careful. Some time ago, in an offsite talk (mentioned here) I discussed the fact that 4 members of the quant team shared the same birthday, and showed that this was apparently an infinitesimally unlikely occurrence. But by considering the fact that it would have seemed surprising for any 4 individuals in the company to share the same birthday, and that there are many such potential combinations of 4 people, the event turned out not to be so very surprising after all.

And there’s a similar issue here… Reid is just one of 72 players on the roster. It happened to be Reid that was tested unusually often, but we’d have been equally surprised if any individual player had been tested at least 6 times in eleven weeks.  Is it surprising, though, that at least one of the 72 players gets tested this often? This is tricky to answer exactly, but can easily be done by simulation. Working this way I found the probability to be around 6.25%. Still unlikely, but not beyond the bounds of plausibility. A rule-of-thumb that’s often applied – and often inappropriately applied – is that if something has less than a 5% probability of occurring by chance, it’s safe to assume that there is something systematic and not random which led to the results; bigger than 5% and we conclude that the evidence isn’t strong enough to exclude the effect just being a random occurrence. So in this case, we couldn’t rule out the possibility that the test allocations are random.

So we have two different answers depending on how the data is interpreted. If we treat the data as specific to Eric Reid, then yes, there is strong evidence to suggest he’s been tested more often than is reasonable if testing is random. But if we consider him as just an arbitrary player in the roster, the evidence isn’t overwhelming that anyone in the roster as a whole has been overly tested,

Which should we go with? Well, each provides a different and valid interpretation of the available data. I would argue – though others might see it differently – that it’s entirely reasonable in this particular case to consider the data just with regard to Eric Reid, since there is a prima facia hypothesis specifically about him in respect of his grievance case against the NFL. In other words, we have a specific reason to be focusing on Reid, that isn’t driven by a dredge through the data.

On this basis, I’d argue that it is perfectly reasonable to question the extent to which the allocation of drugs tests in the NFL is genuinely “random”, and to conclude that there is reasonable evidence that Eric Reid is being unfairly targeted for testing, presumably for political reasons. The number of tests he has faced isn’t ‘statistically impossible’, but sufficiently improbable to give strong weight to this hypothesis.

# Lucky, lucky 2019

Welcome back to Smartodds loves Statistics.

Let’s start the new year with a fun statistic:

2019 is a lucky, lucky year.

Why is that? Well, let’s start with prime numbers. You’ll know that a prime number is a whole number that can’t be written as a multiple of other whole numbers. For example 6 is not a prime number since $6 = 3 \times 2$, but 7 is a prime since it can only be factorised as $7 = 7 \times 1$.

One way of generating the prime numbers is as follows.

2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30….

The first number remaining is 2, so remove all multiples of 2 that are bigger than 2:

2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29…..

The second number remaining is 3, so remove all multiples of 3 that are bigger than 3:

2, 3, 5, 7, 11, 13, 17, 19, 23, 25, 29…..

The third number remaining is 5, so remove all multiples of 5 that are bigger than 5:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29…

And keep going this way. The numbers that remain are the prime numbers.

It’s easy to check that

2, 3, 5, 7, 11, 13, 17, 19, 23, 29

comprise all of the prime numbers that are smaller than 30. To get the bigger prime numbers you just have to apply more steps using the same procedure.

Lucky numbers are generated in much the same way. This time we start with the sequence of all positive whole numbers:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,…..

The second number is 2, so we remove every second number from the sequence, leaving

1, 3, 5, 7, 9, 11, 13, 15, 17, 19, …..

The third number remaining is 5, so we remove every fifth number of the sequence

1, 3, 5, 7, 13, 15, 17, 19, …..

The fourth number remaining is 7, so we remove every 7th number.

1, 3, 5, 7, 13, 15, 19, …..

And so on….

The numbers that remain in this procedure are said to be lucky numbers. And proceeding in this way, it’s easy to check that 2019 is a lucky number. But 2019 isn’t just ‘lucky’, it’s ‘lucky, lucky’. Every whole number can be written uniquely as a multiple of prime numbers. In the case of 2019 the unique prime factorisation is:

$2019 = 3 \times 673$

And… both 3 and 673 are also lucky numbers. So 2019 is doubly lucky in the sense that it is both lucky itself and all of its prime factors are lucky. Moreover, 2019 is the only year this century that has this property, so enjoy it while it lasts.

This post is really more about mathematics than statistics, but how about this? If I take a large number, 2020 say, and pick a number at random from 1 to 2020, what’s the probability that it will be a lucky number? One way to do this would be to identify all of the lucky numbers up to 2020. If there are m such numbers, then the probability a randomly selected number will be lucky is m/2020.  But it turns out there’s a good approximation that can be calculated very easily, and it works for any large number, not just 2020.

A classical result from number theory is that the probability that a randomly selected number in the sequence 1,2,…., N is a prime number, for any large value of N, is approximately

$1/\log(N)$

where  log is the logarithmic function. With N=2020, this is equal to 0.13, so there’s roughly a 13% chance that a number from 1 to 2020 is a prime number. But almost incredibly, this same approximation works also for lucky numbers, so there’s also roughly a 13% chance that a number from 1 to 2020 will be a lucky number. Obviously lucky, lucky numbers are much rarer, and I don’t know of any formula that can be used to calculate the probability of such numbers. The fact that there is just one lucky year this century, though, suggests the probability is pretty low.

# End-of-year feedback

Smartodds loves Statistics has been going for a few months now. I’ll take a break over Christmas, and come back early in the new year.

I have a quick favour to ask though. I’d be really grateful if you’d take a couple of minutes to send me some feedback on the blog. Has it been of interest to you? Have you read many of the posts? Are some types of posts more interesting to you than others? Any feedback at all – positive or negative – will help me to improve the blog in future, if indeed it has a future. Also, if there are things you’d like me to cover, which I haven’t already, please let me know.

You can either mail me at stuart.coles1111@gmail.com or, if you prefer to give feedback anonymously, use this survey form.

One thing I’m especially concerned about is that I received very few responses to recent posts which asked you to think about a problem and make a guess. Are these types of posts especially uninteresting ? Are they too much of a distraction from your day-to-day work? Something else? Again, it would be helpful to know, just so that I can adapt the blog to what works best for everyone overall.

Thanks in advance and I hope you have a great Christmas and happy New Year. I’ll be back in January.

# And the winners are…

In previous posts I discussed the Royal Statistical Society’s ‘Statistic of the Year’ award. I’m now grateful to Richard.Greene@smartodds.co.uk for having pointed out that the winners for 2018 have now been announced. They are as follows:

International award: 90.5%

UK award: 27.8%

Before reading any further you might like to have a quick guess at where those statistics derive from and why they might have been selected.

Actually, the two statistics have contrasting motivations: one is pretty depressing, while the other is a cause for some optimism. Maybe this balance was intentional. The 90.5% is the proportion of plastic waste that has been produced and not recycled. The 27.8% is the peak percentage of all electricity produced in the UK due to solar power on 30 June this year, which actually made it the largest single form of energy production in the UK on that particular day.

You can find a fuller explanation of the awards here. This also includes a list of ‘highly commended’ nominations. I guess my favourite is 16.7%, which is the proportional reduction in the number of Jaffa Cakes in a McVities’ Christmas tube due to shrinkflation. This is the process whereby manufacturers hide actual price increases by reducing the volume of a product – often by stealth – while keeping the price the same. Who knows why so many manufacturers should suddenly be adopting this strategy?

Meanwhile, remember to take note of any potential nominations for the 2019 awards.