Cathy O’Neil an experienced data scientist and mathematics professor illustrates the pitfalls of allowing data scientists to operate in a moral and ethical vacuum including how the poor and disadvantaged are targeted for payday loans, high cost insurance and political messaging on the basis of their zipcodes and other harvested data.
So, WOMD shows how the data-based algorithms that increasingly form the fabric of our lives – from Google to Facebook to banks to shopping to politics – and the statistical methodology behind them are actually pushing societies in the direction of greater inequality and reduced democracy.
At the time of writing WOMD these arguments were still in their infancy; but now we are starting to live the repercussions of the success of the campaign to remove Britain from the EU – which was largely driven by a highly professional exercise in Data Science – they seem much more relevant and urgent.
Anyway, Cathy O’Neil herself recently gave an interview to Bloomberg. Unfortunately, you now have to subscribe to read the whole article, so you won’t see much if you follow the link. But it was an interesting interview for various reasons. In particular, she discussed the trigger which led her to a love of data and mathematics. She wrote that when she was nine her father showed her a mathematics puzzle. And solving that problem led Cathy to a lifelong appreciation of the power of mathematical thinking. She wrote..
… I’ve never felt more empowered by anything since.
It’s more of a mathematical than a statistical puzzle, but maybe you’d like to think about it for yourself anyway…
Consider this diagram:
It’s a chessboard with 2 of the corner squares removed. Now, suppose you had a set of 31 dominoes, with each domino being able to cover 2 adjacent horizontal or vertical squares. Your aim is to find a way of covering the 62 squares of the mutilated board with the 31 dominoes. If you’d like to try it, mail me with either a diagram or photo of your solution; or, if you think it can’t be done, mail me an explanation. I’ll discuss the solution in a future post.
But while simulation is a bit problematic – though immensely entertaining – in football and other sports, it has a totally different meaning in the context of Statistics, and proves to be an essential part of the statistician’s toolbox.
Here’s how it works: at its heart a statistical model describes a process in terms of probabilities. Since computers can be tricked to mimic randomness, this means that in many circumstances they can be used to simulate the process of generating new ‘data’ from the statistical model. These data can then, in turn, be used to learn something about the behaviour of the model itself.
Let’s look at a simple example.
The standard statistical model for a sequence of coin tosses is that each toss of the coin is independent from all others, and that in each toss ‘heads’ or ‘tails’ will each occur with the same probability of 0.5. The code in the following R console will simulate the tossing of 100 coins, print and tabulate the results, and show a simple bar chart of the counts. Just press the ‘Run’ button to activate the code, then toggle between the windows to see the printed and graphical output. Since it’s a simulation you’re likely to get different results if you repeat the exercise. (Just like if you really tossed 100 coins, you’d probably get different results if you did it again.)
tosses<-sample(c('Heads','Tails'), size=nrep*ntoss, replace=T, prob=c(p_head,1-p_head))
hist(l,breaks=(min(l)-.5):(max(l)+.5),col="lightblue",main="Maximum Run Length ",xlab="Length")
#specify number of tosses
#do the simulation
tosses<-sample(c('Heads', 'Tails'), size=ntoss, replace=TRUE)
#show the simulated coins
#show table of results
#draw barplot of results
barplot(tab, col="lightblue", ylab="Count")
That’s perhaps not very interesting, since it’s kind of obvious that we’d expect a near 50/50 split of heads and tails each time we repeat the experiment. But suppose instead we’re interested in runs of heads or tails, and in particular, the longest run of heads or tails in the sequence of 100 tosses. Some of you may remember we did something like this as an experiment at an offsite some years ago. This is sort of relevant to Smartodds since if we make a sequence of 50/50 bets, looking at the longest run of heads or tails is equivalent to looking at the longest run of winning or losing bets. Anyway, the mathematics to calculate the probability of (say) a run of 10 heads or tails occurring is not completely straightforward. But, we can simulate the tossing of 100 coins many times and see how often we get a run of 10. And if we simulate often enough we can get a good estimate of the probability. So, lets try tossing a coin 100 times, count the longest sequence of heads or tails, and repeat that exercise 10,000 times. I’ve already written the code for that. You just have to toggle to the R console window and write
followed by ‘return’. You should get a picture like this
Yours will be slightly different because your simulated tosses will be different from mine, but since we are both simulating many times (10,000 repetitions) the overall picture should be very similar. Anyway, on this basis, I got a run of 10 heads or tails around 400 times (though I could have tabulated the results to get the number more precisely). Since this was achieved in 10,000 simulations, it follows that the probability of a maximum sequence of 10 heads or tails is around 400/10000 = 0.04 or 4%.
This illustrates exactly the procedure we adopt for making predictions from some of our models. Not so much deadball models, from which it’s usually easy to get predictions by a simple formula, but our in running models often require us to simulate the goals (or points) in a game, and to repeat the game many times, in order to get the probability of a certain number of goals/points.
You can increase the accuracy of the calculation by increasing the number of repetitions. This can be prohibitive if the simulations are slow, and a compromise usually has to be accepted between speed and accuracy. Try increasing (or decreasing) the number of repetitions in the above example: what effect does it have on the shape of the graph?
The function runs is actually slightly more general than the above example illustrates. If, for example, you write runs(100, 10000, 0.6), this repeats the above experiment but where the probability of getting a head on any toss of the coin is 0.6. This isn’t too realistic for coin tossing, but would be relevant for sequences of bets, each of which has a 0.6 chance of winning. How do you think the graph will change in this case? Try it and see.
The calculation of the probability of the longest runs in sequences of coin tosses can actually be done analytically, so the use of simulation here is strictly unnecessary. This isn’t the case for many models – including our in running model. In such cases simulation is the only viable method of calculating predictions.
Simulation has important statistical applications other than calculating predictions. Future posts might touch on some of these.
If you had any difficulty running using the R console in this post – either because my explanations weren’t very good, or because the technology failed – please mail to let me know. As I explained at our recent offsite, I’ve set up a page which explains further the use of the R consoles in this blog, and provides an empty console that you can experiment with. But please do write to me if anything’s unclear or if you’d like extra help or explanations.
This is the first in a (possibly short) series of posts giving biographical details of famous statisticians from history. There are many possibilities here, but I’ll generally limit posts to statisticians who’ve actually done something really interesting, especially outside of the traditional world of statistics.
With that in mind, the first entry in the series is John Tukey. Tukey was an American statistician, born in 1915, but died in 2000. He was really at the forefront of modern statistics, moving the subject on from very traditional topics like hypothesis testing and decision theory, to a more comprehensive, exploratory version of the discipline, relying heavily on techniques that were only made possible by 20th century advances in computing. You may have heard of a boxplot, as a method for summarising data measured on a single variable, or more usefully for comparing data on several variables. Well, the boxplot was invented – along with many other useful techniques in the statistician’s toolbox – by Tukey.
Outside of a strictly statistical context, Tukey was the first person to coin the computing term ‘bit‘ (binary digit). He’s also credited with having been the first person to use the term ‘software‘ to describe a computer program, though the same term had previously been used to describe personnel.
But my reason for introducing Tukey as the first in this series is that some of my favourite quotes in statistics are his. A fairly comprehensive list can be found here. I like this one for example:
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
Clients please take note. But my absolute favourite Tukey quote, and my reason for making him the first in this series of famous statisticians, is this quote:
The best thing about being a statistician is that you get to play in everyone’s backyard.
What this means is that statisticians are needed everywhere – by scientists, in government, by doctors, in industry, in finance and indeed by gamblers – and with a standard toolkit of statistical techniques statisticians get to work in many different fields; in other words, in everyone’s backyard. The corollary to this is that techniques developed for use in one backyard, are often just as useful (and sometimes even more useful) in a completely different backyard. In particular, none of the techniques we use for our sports models were originally developed with sports in mind: all of them came from someone else’s backyard.
So, I make no apology for the diverse nature of this blog. The common thread is statistics, but I’ll aim to cover as many backyards as possible.