Fringe benefits

The Edinburgh Fringe Festival is the largest arts festival in the world. The 2019 version has just finished, but Wikipedia lists some of the statistics for the 2018 edition:

  1. the festival lasted 25 days;
  2. it included more than 55,0000 performances;
  3. that comprised 3548 different shows.

The shows themselves are of many different types, including theatre, dance, circus and music. But the largest section of the festival is comedy, and performers compete for the Edinburgh Comedy Awards – formerly known as the Perrier Award – which is given to the best comedy show on the fringe.

I mention all this because the TV Channel Dave also publishes what it regards to be the best 10 jokes of the festival. And number 4 this year was a statistical joke.


A cowboy asked me if I could help him round up 18 cows. I said, “Yes, of course. That’s 20 cows.”

Confession: the joke is really based on arithmetic rather than Statistics.

About a boy

You’re at a party and meet someone. After chatting for a bit, you work out that the girl you’re talking to has 2 kids and one of them is a boy. What are the chances she’s got 2 boys rather than a boy and a girl?

Actually, I really want to ask a slightly more complicated question than this. But let’s take things slowly. Please think about this problem and, if you have time, mail me or send me your answer via this form. Subsequently, I’ll discuss the answer to this problem and ask you the slightly more complicated question that I’m really interested in.

Terrible maps

One of the themes in this blog has been the creative  use of diagrams to represent statistical data. When the data are collected geographically this amounts to using maps to represent data – perhaps using colours or shadings to show how a variable changes over a region, country or even the whole world.

With this in mind I recommend to you @TerribleMaps on twitter.

It’s usually entertaining, and sometimes – though not always – scientific. Here are a few recent examples:

  1. Those of you with kids are probably lamenting right now the length of the summer holidays. But just look how much worse it could be if, for example, you were living in Italy (!):
  2. Just for fun… a map of the United States showing the most commonly used word in each state:
  3. A longitudinal slicing of the world by population size. It’s interesting because the population per size will depend both on the number of countries that are included as well as the population density in those slices.
  4. For each country in the following map, the flag shown is that of the country with which it shares the longest border. For example, the UK has its longest border with Ireland, and so is represented by the Ireland flag. Similarly, France’s flag is that of Brazil!
  5. This one probably only makes sense if you were born in, or have spent time living in, Italy
  6. While this one will help you get clued-up on many important aspects of UK culture:
  7. And finally, this one will help you understand how ‘per capita’ calculations are made. You might notice there’s one country with an N/A entry. Try to identify which country that is and explain why its value is  missing.

In summary, as you’ll see from these examples, the maps are usually fun, sometimes genuinely terrible, but sometimes contain a genuine pearl of statistical or geographical wisdom. If you have to follow someone on twitter, there are worse choices you could make.



You looking at me?

Statistics: helping you solve life’s more difficult problems…

You might have read recently – since it was in every news outlet here, here, here, here, here, here, and here for example – that recent research has shown that staring at seagulls inhibits them from stealing your food. This article even shows a couple of videos of how the experiment was conducted. The researcher placed a package of food some metres in front of her in the vicinity of a seagull. In one experiment she watched the bird and timed how long it took before it snatched the food. She then repeated the experiment, with the same seagull, but this time facing away from the seagull. Finally, she repeated this exercise with a number of different seagulls in different locations.

At the heart of the study is a statistical analysis, and there are several points about both the analysis itself and the way it was reported that are interesting from a wider statistical perspective:

  1. The experiment is a good example of a designed paired experiment. Some seagulls are more likely to take food than others regardless of whether they are being looked at or not. The experiment aims to control for this effect by using pairs of results from each seagull: one in which the seagull was stared at, the other where it was not. By using knowledge that the data are in pairs this way, the accuracy of the analysis is improved considerably. This makes it much more likely to identify a possible effect within the noisy data.
  2. To avoid the possibility that, for example, a seagull is more likely to take food quickly the second time, the order in which the pairs of experiments are applied is randomised for each seagull.
  3. Other factors are also controlled for in the analysis: the presence of other birds, the distance of the food, the presence of other people and so on.
  4. The original experiment involved 74 birds, but many were uncooperative and refused the food in one or other of the experiments. In the end the analysis is based on just 19 birds who took food both when being stared at and not. So even though results prove to be significant, it’s worth remembering that the sample on which results were based is very small.
  5. It used to be very difficult to verify the accuracy of a published statistical analysis. These days it’s almost standard for data and code to be published alongside the manuscript itself. This enables readers to both check the results and carry out their own alternative analyses. For this paper, which you can find in full here, the data and code are available here.
  6. If you look at the code it’s just a few lines from R. It’s notable that such a sophisticated analysis can be carried out with such simple code.
  7. At the risk of being pedantic, although most newspapers went with headlines like ‘Staring at seagulls is best way to stop them stealing your chips‘, that’s not really an accurate summary of the research at all. Clearly, a much better way to stop seagulls eating your food is not to eat in the vicinity of seagulls. (Doh!) But even aside from this nit-picking point, the research didn’t show that staring at seagulls stopped them ‘stealing your chips’. It showed that, on average, the seagulls that bother to steal your chips, do so more quickly when you are looking away. In other words, the headline should be:

If you insist on eating chips in the vicinity of seagulls, you’ll lose them quicker if you’re not looking at them

Guess that’s why I’m a statistician and not a journalist.

The issue of designed statistical experiments was something I also discussed in an earlier post. As I mentioned then, it’s an aspect of Statistics that, so far, hasn’t much been exploited in the context of sports modelling, where analyses tend to be based on historically collected data. But in the context of gambling, where different strategies for betting might be compared and contrasted, it’s likely to be a powerful approach. In that case, the issues of controlling for other variables – like the identity of the gambler or the stake size – and randomising to avoid biases will be equally important.


Zipf it

In a recent post I explained that in a large database of containing the words from many English language texts of various types, the word ‘football’ occurred 25,271 times, making it the 1543rd most common word in the database. I also said that the word ‘baseball’ occurred 28,851 times, and asked you to guess what its rank would be.

With just this information available, it’s impossible to say with certainty what the exact rank will be. We know that ‘baseball’ is more frequent than ‘football’ and so it must have a higher rank (which means a rank with a lower number). But that simply means it could be anywhere from 1 to 1542.

However, we’d probably guess that ‘baseball’ is not so much more popular a word than ‘football’; certainly other words like ‘you’, ‘me’, ‘please’ and so on are likely to occur much more frequently. So, we might reasonably guess that the rank of ‘baseball’ is closer to the lower limit of 1542 than it is to the upper limit of 1. But where exactly should we place it?

Zipf’s law provides a possible answer.

In its simplest form Zipf’s law states that for many types of naturally occurring data – including frequencies of word counts – the second most common word occurs half as often as the most common; the third most common occurs a third as often as the most popular; the fourth most common occurs a quarter as often; and so on. If we denote by f(r) the frequency of the item with rank r, this means that

f(r) = C/r


r\times f(r)=C,

where C is the constant f(1). And since this is true for every choice of r, the frequencies and ranks of the words ranked r and s are related by

r\times f(r)=s \times f(s).

Then, assuming Zipf law applies,

rank(\mbox{`baseball'}) = rank(`football') \times f(\mbox{`football'})/f(\mbox{`baseball'})

= 1543 \times 25271/28851 \approx 1352

So, how accurate is this estimate? The database I extracted the data from is the well-known Brown University Standard Corpus of Present-Day American EnglishThe most common 5000 words in the database, together with their frequencies, can be found here. Searching down the list, you’ll find that the rank of ‘baseball’ is 1380, so the estimated value of 1352 is not that far out.

But where does Zipf’s law come from? It’s named after the linguist George Kingsley Zipf (1902-1950), who observed the law to hold empirically for words in different languages. Rather like Benford’s law, which we discussed in an earlier post, different arguments can be constructed that suggest Zipf’s law might be appropriate in certain contexts, but none is overwhelmingly convincing, and it’s really the body of empirical evidence that provides its strongest support.

Actually, Zipf’s law

f(r) = C/r,

is equivalent to saying that the frequency distribution follows a power law where the power is equal to -1. But many fits of the model to data can be improved by generalising this model to


for some constant k. In this more general form the law has been shown to work well in many different contexts, including sizes of cities, website access counts, gene expression frequencies and strength of volcanic eruptions. The version with k=1 is found to work well for many datasets based on frequencies of word counts, but other datasets often require different values of k. But to use this more general version of the law we’d have to know the value of k, which we could estimate if we had sufficient amounts of data. The simpler Zipf’s law has k=1 implicitly, and so we were able to estimate the rank of ‘baseball’ with just the limited amount of information provided.

Finally, I had just 3 responses to the request for predictions of the rank of ‘baseball’: 1200, 1300 and 1450, each of which is entirely plausible. But if I regard each of these estimates as those of an expert and try combining those expert opinions by taking the average I get 1317, which is very close to the Zipf law prediction of 1352. Maybe if I’d had more replies the average would have been even closer to the Zipf law estimate or indeed to the true answer itself 😏.



Many of you will know that my involvement with Smartodds stems from co-authorship of an academic paper with Mark Dixon. In this paper we developed a statistical model for calculating probabilities of football match results. Since then I’ve sometimes been asked – and indeed, was asked at a previous Smartodds offsite meeting – whether I regretted publishing that paper, rather than simply using its methodology to try to make money from bookmakers.

There are several answers to that question, including:

  1. Mark was really the principal author for that work, and so it was mostly his choice what we did with it;
  2. At the time I was genuinely more interested in the academic side of the work, rather than any potential it had for generating money;
  3. The model alone was, at best, only marginally profitable. Without additional knowledge from football experts, it was unlikely to make money;
  4. If we hadn’t published the paper, I’d probably have never ended up being connected to Smartodds.

Anyway, I recently thought about all this while following the Guardian coverage of England’s cricket World Cup semi-final, which mentioned that at that stage of the game – sometime in England’s innings after New Zealand had set a target of  – CricViz were giving England a 79% chance of winning. I’d never heard of CricViz, so I followed the links and discovered that it’s fundamentally an in-running cricket model that sits on your phone. You can get a complete description and links to download for Android or IOS here.

In terms of interface, CricViz is light years ahead of the work on football that I published with Mark Dixon. If you’d wanted to make predictions for football matches having read our original paper, you’d have had to collect data, program the model and run the predictions yourself. CricViz gives you live predictions for important matches both before the match starts, and over-by-over as the match progresses. It’s brilliant. And so, a similar question might be put to the authors of CricViz: why give this tool away for free, instead of using the methodology to fleece the bookmakers?

There are probably multiple answers to this question too, but one central issue is obviously the quality of the model on which CricViz is based. Though my paper with Mark Dixon didn’t make it easy for readers to calculate match numbers for themselves, it did provide both a complete mathematical recipe for what was needed as well as an analysis of historical results demonstrating its potential. CricViz does neither of those things. Its home website simply states…

WinViz  is built upon CricViz’s proprietary model of T20 cricket. This model takes the career records of the players involved, historical data from the venue and country where the match is played, and the current match situation. the model then computes the probability of each result.

So, although you can launch WinViz on your phone to generate numbers live as a match progresses, the details of how those numbers are calculated are sketchy. Let’s make some guesses though…

A complicating feature of cricket is that there are different factors that contribute to the strength of a team’s position during a game, including:

  1. The number of runs the other team has already scored, where appropriate;
  2. The number of runs the batting team has scored so far in the innings;
  3. The number of wickets for the batting team that have already been lost;
  4. The number of balls remaining in the innings.

And all of this is before taking account of the actual strength of the two teams.

But we’ve discussed this issue in an earlier post – here – and it also got a mention here.  In summary, a team’s remaining strength in an innings can be considered to be a function of the resources still available to them, as measured by balls and wickets. And in a landmark study, Duckworth and Lewis developed a formula which maps available resources into expected runs. Their objective was to develop a method that would provide a fair target for teams when matches were reduced by bad weather, leading to different numbers of balls received by each team. But, the Duckworth-Lewis formula works equally well as a baseline method for in-running match predictions in matches without weather restrictions. And it’s likely that when the authors of CricViz say their model takes into account ‘current match situation’, this is precisely what they mean and how they do it.

The rest is more vague though. The career history of the players involved is taken into account, as is the history of previous matches in the same stadium and country. This suggests some kind of regression modelling that takes account of these aspects, but it’s not clear whether this applies to the Duckworth-Lewis adjustment itself or to the baseline deadball numbers to which the Duckworth-Lewis adjustment is applied.

For example: the deadball estimate for the number of runs scored in a complete innings by a particular team might be 300. After they have scored 100 runs it might be that Duckworth-Lewis calculations lead to the assessment that they have used 25% of their resources for that innings. In which case, they would be predicted to score a further 75% of 300 on top of the 100 they have already scored, for a total of 325. And the WinViz model might imply adjustments to the 300 or to the 75% or both, depending on the team composition and the history of matches in that particular stadium and country.

But how well does WinViz perform? It’s actually very difficult to tell, since – perhaps to avoid scrutiny – the CricViz app includes a history section of recent matches only. For example, when writing this blog post soon after the World Cup, all World Cup matches were available, but they’ve now been deleted. So, it’s not possible to do any kind of serious diagnostic analysis of model performance, though a ‘sanity-check’ can be done on any of the games currently available in the history.

For example, here’s the story of England’s world cup final victory against New Zealand as told by WinViz at different stages go the match. Each of the figures is a screenshot of the CricViz iPhone app at the relevant stage in the match. The main features of each figure are predicted match outcome probabilities given current score and a graph showing the way the batting team’s score has increased throughout the innings so far, and how it’s predicted to increase over the rest of the innings.

  1. New Zealand are to bat first and are predicted to score 305. With 50% probability their score is expected to fall in the interval (261, 349). England are expected to beat New Zealand’s score with probability 68%. The tie has just a 2% prediction probability.

2. After 15 overs, New Zealand have made a steady start in that the’ve only lost a single wicket. however, their scoring rate is quite low, so the England win probability has gone up slightly to 73%.

3. After 25 overs, New Zealand have kept the run rate ticking over, and have lost just one further wicket. England’s win probability remains pretty much unchanged.

4. After 30 overs New Zealand’s run rate has slowed a little and they’ve lost a further wicket. England’s win probability increases further to 81%.

5. On 35 overs New Zealand are still scoring at a slowish rate and have lost a fourth wicket. England now have an 86% win probability.

6. At the end of New Zealand’s innings, New Zealand amassed 241 runs. This is way short of England’s expected run total, which therefore leads them to maintaining an 86% win probability. (The following screenshot was taken during the final over when New Zealand had scored 240 runs).

7. England make a slow start in the first over, scoring just a single run. Their win probability drops just very slightly.

8. After 26 overs, England have had a mini-collapse, having scored just 94 runs – a lower figure than New Zealand made at the same point in their innings – while having lost 4 wickets. Their win probability drops dramatically to 48%.

9. A mini-recovery. On 41 overs England have increased their score to 168 – similar to New Zealand’s at the same point in their innings – without further loss of wickets. England’s win probability jumps back up to 66%.

10. After 49 overs, England are in trouble. With just one over left, England are 15 runs behind with 2 wickets remaining. The model gives England just an 18% probability of winning outright, though the tie also has a fairly high probability of 11%.

11. The rest is history.

It’s obviously impossible to validate the precision of WinViz from a single game, but notwithstanding at least one bug in the graphics – England’s ‘to win’ target is incorrect throughout – the basic sanity check seems to be satisfactory for this match at least.

CricViz was developed by Nathan Leamon, who acts as a data analyst for the England team. An interesting article on his background and perspective on the use of data for supporting team development is available here. David Hastie, who used to work for our quant team, also played some part in the rollout of CricViz, and kindly provided me with additional background information to help with the writing of this post.

Off script

off script

So, how did your team get on in the first round of Premier League fixtures for the 2019-20 season? My team, Sheffield United, were back in the top flight after a 13-year absence. It didn’t go too well though. Here’s the report:

EFL goal machine Billy Sharp’s long wait for a top-flight strike ends on the opening day. Ravel Morrison with the assist. But Bournemouth run out 4-1 winners.

And as if that’s not bad enough, we finished the season in bottom place:


Disappointing, but maybe not unexpected.

Arsenal also had a classic Arsenal season. Here’s the story of their run-in:

It seems only the Europa League can save them. They draw Man United. Arsenal abandon all hope and crash out 3-2. Just as they feared. Fans are more sad than angry. Once again they rally. Aubameyang and Alexandre Lacazette lead a demolition of high flying Liverpool. But they drop too many points and end up trophyless with another fifth-place finish.

Oh, Arsenal!

But what is this stuff? The Premier League doesn’t kick off for another week, yet here we have complete details of the entire season, match-by-match, right up to the final league table.

Welcome to The Script, produced by BT Sport. As they themselves explain:

Big data takes on the beautiful game.

And in slightly more detail…

BT has brought together the biggest brains in sports data, analysis and machine learning to write the world’s first artificial intelligence-driven script for a future premier league season.

Essentially, BT Sport have devised a model for match outcomes based on measures of team abilities in attack and defence. So far, so standard. After which…

We then simulate the random events that could occur during a season – such as injuries and player transfers – to give us even more accurate predictions.

But this is novel. How do you assign probabilities to player injuries or transfers? Are all players equally susceptible to injury? Do the terms of a player’s contract affect their chances of being sold? And who they are sold too? And what is the effect on a team’s performance of losing a player?

So, this level of modelling is difficult. But let’s just suppose for a minute you can do it. You have a model for what players will be available for a team in any of their fixtures. And you then have a model that, given the 2 sets of players that are available to teams for any fixture, spits out the probabilities of the various possible scores. Provided the model’s not too complicated, you can probably first simulate the respective lineups in a match, and then the scores given the team lineups. And that’s why Sheffield United lost 4-1 on the opening day to Bournemouth. And that’s why Arsenal did an Arsenal at the end of the season. And that’s why the league table ended up like it did above.

But is this a useful resource for predicting the Premier League?

Have a think about this before scrolling down. Imagine you’re a gambler, looking to bet on the outcome of the Premier League season. Perhaps betting on who the champions will be, or the top three, or who will be relegated, or whether Arsenal will finish fifth. Assuming BT’s model is reasonable, would you find the Script that they’ve provided helpful in deciding what bets to make?


Personally, I think the answer is ‘no’, not very helpful. What BT seem to have done is run A SINGLE SIMULATION of their model, for every game over the entire season, accumulating the simulated points of each team per match to calculate their final league position.


Imagine having a dice that you suspected of being biased, and you tried to understand its properties with a single roll. It’s almost pointless. Admittedly, with the Script, each team has 38 simulated matches, so the final league table is likely to be more representative of genuine team ability than the outcome of a single throw of a dice. But still, it’s the simulation of just a single season.

What would be much more useful would be to simulate many seasons and count, for example, in how many of those seasons Sheffield United were relegated. This way the model would be providing an estimate of the probability that Sheffield United gets relegated, and we could compare that against market prices to see if it’s a worthwhile bet.

In summary, we’ve seen in earlier posts (here and here, for example) contenders for the most pointless simulation in a sporting context, but the Script is lowering the bar to unforeseen levels. Despite this, if the blog is still going at the end of the season, I’ll do an assessment of how accurate the Script’s estimates turned out to be.


Data controversies


Some time ago I wrote about Mendel’s law of genetic inheritance, and how statistical analysis of Mendel’s data suggested his results were too good to be true. It’s not that his theory is wrong; it’s just that the data he provided as evidence for his theory seem to have been manipulated in such a way as to seem incontrovertible. Unfortunately the data lack the variation that Mendel’s own law would also imply should occur in measurements of that type, leading to the charge that the data had been manufactured or manipulated in some way.

Well, there’s a similar controversy about the picture at the top of this page.

The photograph, taken 100 years ago, was as striking at that time as the recent picture of a black hole, discussed in an earlier post, is today. However, this picture was taken with basic photographic equipment and telescopic lens and shows a total solar eclipse, as the moon passes directly between the Earth and the Sun.

A full story of the controversy is given here.

In summary: Einstein’s theory of general relativity describes gravity not as a force between two attracting masses – as is central to Newtonian physics – but as a curvature caused in space-time due to the presence of massive objects. All objects cause such curvature, but only those that are especially massive, such as stars and planets, will have much of an effect.

Einstein’s relativity model was completely revolutionary compared to the prevailing view of physical laws at the time. But although it explained various astronomical observations that were anomalous according to Newtonian laws, it had never been used to predict anomalous behaviour. The picture above, and similar ones taken at around the same time, changed all that.

In essence, blocking out the sun’s rays enabled dimmer and more distant stars to be accurately photographed. Moreover, if Einstein’s theory were correct, the photographic position of these stars should be slightly distorted because of the spacetime curvature effects of the sun. But the effect is very slight, and even Newtonian physics suggests some disturbance due to gravitational effects.

In an attempt to get photographic evidence at the necessary resolution, the British astronomer Arthur Eddington set up two teams of scientists – one on the African island of Príncipe, the other in Sobral, Brazil – to take photographs of the solar eclipse on 29 May, 1919. Astronomical and photographic equipment was much more primitive in those days, so this was no mean feat.

Anyway, to cut a long story short, a combination of poor weather conditions and other setbacks meant that the results were less reliable than were hoped for. It seems that the data collected at Príncipe, where Eddington himself was stationed, were inconclusive, falling somewhere between the Newton and Einstein model predictions. The data at Sobral were taken with two different types of telescope, with one set favouring the Newton view and the other Einstein’s. Eddington essentially combined the Einstein-favouring data from Sobral together with those from Príncipe and concluded that the evidence supported Einsteins relativistic model of the universe.

Now, in hindsight, with vast amounts of empirical evidence of many types, we know Einstein’s model to be fundamentally correct. But did Eddington selectively choose his data to support Einstein’s model?

There are different points of view, which hinge on Eddington’s motivation for dropping a subset of the Sobral data from his analysis. One point of view is that he wanted Einstein’s view to be correct, and therefore simply ignored the data that were less favourable. This argument is fuelled by political reasoning: it sarges that since Eddington was a Quaker, and therefore a pacifist, he wanted to support a German theory as a kind of post-war reconciliation.

The alternative point of view, for which there is some documentary evidence, is that the Sobral data which Eddington ignored had been independently designated as unreliable. Therefore, on proper scientific grounds, Eddington had behaved entirely correctly by excluding it from his analysis, and his subsequent conclusions favouring the Einstein model were entirely consistent with the scientific data and information he had available.

This issue will probably never be fully resolved, though in a recent review of several books on the matter, theoretical physicist Peter Coles (no relation) claims to have reanalysed the data given in the Eddington paper using modern statistical methods, and found no reason to doubt his integrity. I have no reason to doubt that point of view, but there’s no detail of the statistical analysis that was carried out.

What’s interesting though, from a statistical point of view, is how the interpretation of the results depends on the reason for the exclusion of a subset of the Sobral data. If your view is that Eddington knew their contents and excluded them on that basis, then his conclusions in favour of Einstein must be regarded as biased. If you accept that Eddington excluded these data a priori because of their unreliability, then his conclusions were fair and accurate.

Data are often treated as a neutral aspect of an analysis. But as this story illustrates, the choice of which data to include or exclude, and the reasons for doing so, may be factors which fundamentally alter the direction an analysis will take, and the conclusions it will reach.




66,666 Random Numbers, Volume 1

A while ago I posted about gambling at roulette, and explained that whatever strategy you adopt – excluding the possibility of using electronic equipment to monitor the wheel and ball speeds, and improve prediction of where the ball will land – no strategy can overcome the edge that casinos offer by giving unfavourable odds on winning outcomes. Now, believe it or not, I do a fair bit of research to keep this blog ticking over. And in the course of doing the research for a potential casino/roulette post, I came across this book:

That’s right: 66,6666 random numbers. But not just any numbers: the numbers on a roulette wheel, 0-36. The numbers are also colour coded as per a standard roulette wheel. Here’s a typical page:


But there’s more:

  1. The book includes a bonus set of an extra 10,000 random numbers. (Question: why not just call the book 76,666 random numbers?)
  2. There’s also an American Edition, which is almost identical, but accounts for the fact that in American casinos, the wheel also includes a 00.
  3. This is just Volume 1. Further volumes don’t seem to have gone into production yet, but the title suggests it’s just a matter of time.

Now, tables of random numbers have their place in history. As explained in an earlier post, simulation is a widely-used technique in statistical analysis, when exact mathematical calculations for statistical problems are too complicated. And before computers were widely available, it was commonplace to use tables of random digits as the basic ingredient for simulation routines.

But, hello! This is 2019. Chances are there’s a reasonable random number generator in the calculator on your phone. Or you can go here and fiddle around with the settings in the dialog box. Or you can fire-off 66,666 random numbers with a one-line code in R or any other statistical language. You can even do it here:

# simulate the results numbers <- sample(0:36, 66666, replace=T) # tabulate the results table(numbers) # show results as a barplot library(ggplot2) df<-data.frame(table(numbers)) colnames(df)<-c('number','frequency') ggplot(data=df, aes(x=number, y=frequency)) + geom_bar(stat="identity", width=0.5, fill='lightblue') +ggtitle('Frequencies of Results in 66,666 Roulette Spins')

Just hit the ‘run’ button. This may not work with all browsers, but seems to work ok with Chrome.

The simulation is all done in the first non-comment line. The rest is just some baggage to tabulate the frequencies and show them graphically.

This approach has the advantages that:

  1. You get different numbers every time you repeat the exercise, just like in real life;
  2. The numbers are stored electronically, so you can analyse them easily using any statistical functions. If you ran the R session above, you’ll have seen the results in tabulated summary form, as well as in a barplot, for example. But since the data are stored in the object ‘numbers’, you can do anything you like with them. For example, typing ‘mean(numbers)’ give you the mean of the complete set of simulated spins.

So, given that there are many easy ways you can generate random numbers, why would anybody possibly want to buy a book with 66,666 random numbers? Well, here’s the author to explain:

After gaining a moderate amount of experience playing roulette, I discovered how easy it was to master the rules of the game – and still lose!

He goes on…

Having lost my bankroll and now distrusting my knowledge of statistics as they pertained to roulette, I scoured the Internet for more information on the game. My findings only confirmed what I already knew: that statistics can only define the shape and character of data and events that have already taken place and have no real bearing over the outcome of future spins.

And finally…

I chose to compile a book of 66,666 random numbers for two reasons: One, I’ve paid my dues – literally, I’ve lost thousands of dollars playing this game, and I don’t want you to suffer the same consequence; two, as roulette is a game played against the house and not against my fellow gamblers, I knew I wanted to provide you with the same opportunity to study these numbers and learn something that might just make a difference in the way you play the game.

In summary, despite having lost a fortune believing there is some system to win at roulette, and despite sincerely wishing that you avoid the same fate, having learned through experience that no roulette system can possibly work, the author has provided you with 66,666 spins (plus a bonus 10,000 extra spins) of a roulette wheel so that you can study the numbers and devise your own system.(Which is bound to fail and almost certainly cost you a fortune if you try to implement it).

Now, just to emphasise:

  1. The random properties of a roulette wheel are very simply understood from basic probability;
  2. A study of the outcome of randomly generated spins of a roulette wheel is a poor substitute for these mathematical properties;
  3. Biases in the manufacture or installation of a roulette wheel, which could make some numbers, or sequences of numbers, more frequent than others, are likely to be vanishingly small. But if there were such biases, you’d need to study a very long series of the outcomes of that particular wheel to be able to exploit them;
  4. You might choose to play roulette for fun. And you might even get lucky and win. But it is a game with negative expected winnings for the gambler, and if you play long enough you will lose with 100% certainty.

However, we’ve seen a similar mis-use of simulation before. In this post a newspaper did 100 random simulations of the NBA lottery draft in order to predict the lottery outcome. The only difference with the roulette simulation is that 66,666 is a rather bigger number – and therefore greater waste of time – than 100.

Moral: simulation can often be avoided through a proper understanding of the randomness in whatever process you are studying. But if you really have to simulate, learn the basics of a language like R; don’t waste time and money on books of random tables.

Word rank

I recently came across a large database of the use of English-American words. It aims to provide a representative sample of the usage English-American by including the words extracted from a large number of English texts of different types – books, newspaper articles, magazines etc. In total it includes around 560 million words collected over the years 1990-2017.

The word ‘football’ occurs in the database 25,271 times and has rank 1543. In principle, this means that ‘football’ was the 1543rd most frequent word in the database, though the method used for ranking the database elements is a little more complicated than that, since it attempts to combine a measure of both the number of times the word appears and the number of texts it appears in. Let’s leave that subtlety aside though and assume that ‘football’, with a frequency of 25,271, is the 1543rd most common word in the database.

The word ‘baseball’ occurs in the same database 28,851 times. With just this information, what would you predict the rank of the word ‘baseball’ to be? For example, if you think ‘baseball’ is the most common word, it would have rank 1. (It isn’t: ‘the’ is the most common word). If you think ‘baseball’ would be the 1000th most common word, your answer would be 1000.

Give it a little thought, but don’t waste time on it. I really just want to use the problem as an introduction to an issue that I’ll discuss in a future post. I’d be happy to receive your answer though, together with an explanation if you like, by mail. Or if you’d just like to fire an answer anonymously at me, without explanation, you can do so using this survey form.