Pulp Fiction (Our Esteemed Leader’s cut)

The previous post had a cinematic theme. That got me remembering an offsite a while back where Matthew.Benham@smartbapps.co.uk gave a talk that I think he called ‘Do the Right Thing’, which is the title of a 1989 Spike Lee film. Midway through his talk Matthew gave a premiere screening of his own version of a scene from Pulp Fiction. Unfortunately, I’ve been unable to get hold of a copy of Matthew’s cut, so we’ll just have to make do with the inferior original….

The theme of Matthew’s talk was the importance of always acting in relation to best knowledge, even if it contradicts previous actions taken when different information was available. So, given the knowledge and information you had at the start of a game, you might have bet on team A. But if the game evolves in such a way that a bet on team B becomes positive value, you should do that. Always do the right thing. And the point of the scene from Pulp Fiction? Don’t let pride get in the way of that principle.  

These issues will make a great topic for this blog sometime. But this post is about something else…

Dependence is a big issue in Statistics, and we’re likely to return to it in different ways in future posts. Loosely speaking, two events are said to be independent if knowing the outcome of one, doesn’t affect the probabilities of the outcomes of the other. For example, it’s usually reasonable to treat the outcomes of two different football matches taking place on the same day as independent. If we know one match finished 3-0, that information is unlikely to affect any judgements we might have about the possible outcomes of a later match. Events that are not independent are said to be dependent: in this case, knowing the outcome of one will affect the outcome of the other.  In tennis matches, for example, the outcome of one set tends to affect the chances of who will win a subsequent set, so set winners are dependent events. 

With this in mind, let’s follow-up the discussion in the previous 2 posts (here and here) about accumulator bets. By multiplying prices from separate bets together, bookmakers are assuming that the events are independent. But if there were dependence between the events, it’s possible that an accumulator offers a value bet, even if the individual bets are of negative value. This might be part of the reason why Mark Kermode has been successful in several accumulator bets over the years (or would have been if he’d taken his predictions to the bookmaker and actually placed an accumulator bet).

Let me illustrate this with some entirely made-up numbers. Let’s suppose ‘Pulp Fiction (Our Esteemed Leader’s cut)’, is up for a best movie award, and its upstart director, Matthew Benham, has also been nominated for best director. The numbers for single bets on PF and MB are given in the following table. We’ll suppose the bookmakers are accurate in their evaluation of the probabilities, and that they guarantee themselves an expected profit by offering prices that are below the fair prices (see the earlier post). 

  True Probability Fair Price Bookmaker Price
Best Movie: PF 0.4 2.5 2
Best Director: MB 0.25 4 3.5

 

Because the available prices are lower than the fair prices and the probabilities are correct, both individual bets have negative value (-0.2 and -0.125 respectively for a unit stake). The overall price for a PF/MB accumulator bet is 7, which assuming independence is an even poorer value bet, since the expected winnings from a unit stake are

0.4 \times 0.25 \times 7 -1 = -0.3

However, suppose voters for the awards tend to have similar preferences across categories, so that if they like a particular movie, there’s an increased chance they’ll also like the director of that movie. In that case, although the table above might be correct, the probability of MB winning the director award if PF (MB cut) is the movie winner is likely to be greater than 0.25. For argument’s sake, let’s suppose it’s 0.5. Then, the expected winnings from a unit stake accumulator bet become

0.4 \times 0.5 \times 7 -1 = 0.4

That’s to say, although the individual bets are still both negative value, the accumulator bet is extremely good value. This situation arises because of the implicit assumption of independence in the calculation of accumulator prices. When the assumption is wrong, the true expected winnings will be different from those implied by the bookmaker prices, potentially generating a positive value bet.

Obviously with most accumulator bets – like multiple football results – independence is more realistic, and this discussion is unhelpful. But for speciality bets like the Oscars, or perhaps some political bets where late swings in votes are likely to affect more that one region, there may be considerable value in accumulator bets if available.


If anyone has a copy of Our Esteemed Leader’s cut of the Pulp Fiction scene on a pen-drive somewhere, and would kindly pass it to me, I will happily update this post to include it. 

“Random”

You probably remember the NFL quarterback Colin Kaepernick who started the protest against racism in the US by kneeling during the national anthem. In an earlier post  I discussed how his statistics suggested he was being shunned by NFL teams due to his political stance. And in a joint triumph for decency and marketing, he subsequently became the current face of Nike.

Since I now follow Kaepernick on Twitter, I recently received a tweet sent by Eric Reid of the Carolina Panthers. Reid was the first player to kneel alongside Kaepernick when playing for the San Francisco 49ers. But when his contract expired in March 2018, Reid also struggled to find a new club, despite his form suggesting he’d be an easy selection. Eventually, he joined Carolina Panthers after the start of the 2018-19 season, and opened a dispute with the NFL, claiming that, like Kaepernick, he had been shunned by most teams as a consequence of his political actions. 

This was his tweet:

The ‘7’ refers to the fact that Reid had been tested seven times since joining the Panthers in the standard NFL drug testing programme, and the “random” is intended ironically. That’s to say, Reid is implying that he’s being tested more often than is plausible if tests are being carried out randomly: in other words, he’s being victimised for the stand he’s taking against the NFL

Reid is quoted as saying:

I’ve been here 11 weeks, I’ve been drug-tested seven times. That has to be statistically impossible. I’m not a mathematician, but there’s no way that’s random.

Well, let’s get one thing out of the way first of all: the only things that are statistically impossible are the things that are actually impossible. And since it’s possible that a randomised allocation of tests could lead to seven or more tests in 11 weeks, it’s certainly not impossible, statistically or otherwise. 

However… Statistics is almost never about the possible versus the impossible; yes versus no; black versus white (if you’ll excuse the double entendre). Statistics is really about degrees of belief. Does the evidence suggest one version is more likely than another? And to what extent is that conclusion reliable?

Another small technicality… it seems that the first of Reid’s drug tests was actually a mandatory test that all players have to take when signing on for a new team. So actually, the question is whether the subsequent 6 tests in 11 weeks are unusually many if the tests are genuinely allocated randomly within the team roster.

On the face of it, this is a simple and standard statistical calculation. There are 72 players on a team roster and 10 players each week are selected for testing. So, under the assumption of random selection, the probability that any one player is tested any week is 10/72. Standard results then imply that the probability of a player being selected on exactly 6 out of 11 occasions – using the binomial distribution for those of you familiar with this stuff – is around 0.16%, while the probability of being tested 6 times or more is 0.17%. On this basis, there’s only a 17 in 10,000 chance that Reid would have been tested at least as often as he has been under a genuinely random procedure, and this would normally be considered small enough to provide evidence that the procedure is not random, and that Reid has been tested unduly often.  

 

However, we need to be a bit careful. Some time ago, in an offsite talk (mentioned here) I discussed the fact that 4 members of the quant team shared the same birthday, and showed that this was apparently an infinitesimally unlikely occurrence. But by considering the fact that it would have seemed surprising for any 4 individuals in the company to share the same birthday, and that there are many such potential combinations of 4 people, the event turned out not to be so very surprising after all.

And there’s a similar issue here… Reid is just one of 72 players on the roster. It happened to be Reid that was tested unusually often, but we’d have been equally surprised if any individual player had been tested at least 6 times in eleven weeks.  Is it surprising, though, that at least one of the 72 players gets tested this often? This is tricky to answer exactly, but can easily be done by simulation. Working this way I found the probability to be around 6.25%. Still unlikely, but not beyond the bounds of plausibility. A rule-of-thumb that’s often applied – and often inappropriately applied – is that if something has less than a 5% probability of occurring by chance, it’s safe to assume that there is something systematic and not random which led to the results; bigger than 5% and we conclude that the evidence isn’t strong enough to exclude the effect just being a random occurrence. So in this case, we couldn’t rule out the possibility that the test allocations are random.

So we have two different answers depending on how the data is interpreted. If we treat the data as specific to Eric Reid, then yes, there is strong evidence to suggest he’s been tested more often than is reasonable if testing is random. But if we consider him as just an arbitrary player in the roster, the evidence isn’t overwhelming that anyone in the roster as a whole has been overly tested,

Which should we go with? Well, each provides a different and valid interpretation of the available data. I would argue – though others might see it differently – that it’s entirely reasonable in this particular case to consider the data just with regard to Eric Reid, since there is a prima facia hypothesis specifically about him in respect of his grievance case against the NFL. In other words, we have a specific reason to be focusing on Reid, that isn’t driven by a dredge through the data. 

On this basis, I’d argue that it is perfectly reasonable to question the extent to which the allocation of drugs tests in the NFL is genuinely “random”, and to conclude that there is reasonable evidence that Eric Reid is being unfairly targeted for testing, presumably for political reasons. The number of tests he has faced isn’t ‘statistically impossible’, but sufficiently improbable to give strong weight to this hypothesis. 

 

 

 

Worst use of Statistics of the year

You might remember in a couple of earlier posts (here and here) I discussed the Royal Statistical Society’s ‘Statistic of the Year’ competition. I don’t have updates on the results of that competition for 2018 yet, but in the meantime I thought I’d do my own version, but with a twist: the worst use of Statistics in 2018.

To be honest,  I only just had the idea to do this, so I haven’t been building up a catalogue of options throughout the year. Rather, I just came across an automatic winner in my twitter feed this week.

So, before announcing the winner, let’s take a look at the following graph:

This graph is produced by the Office for National Statistics, which is the UK government’s own statistical agency, and shows the change in average weekly wages in the UK, after allowance for inflation effects, for the period 2008-2018. 

There are several salient points that one might draw from this graph:

  1. Following the financial crash in 2008, wages declined steadily over a 6-year period to 2014, where they bottomed-out at around 10% lower than pre-crash levels.
  2. The election of a Conservative/Lib Dem coalition government in 2010 didn’t have any immediate impact on the decline of wage levels. Arguably the policy of intense austerity may simply have exacerbated the problem.
  3. Things started to pick up during 2014, most likely due to the effects of Quantitative Easing and other efforts to stimulate the economy by the Bank of England in the period after the crash.
  4. Something sudden happened in 2016 which seems to have choked-off the recovery in wage levels. (If only there was a simple explanation for what that might be.)
  5. Wages are currently at the same level as they were 7 years ago in 2011, and significantly lower than they were immediately following the financial crash in 2008.

So that’s my take on things. Possibly there are different interpretations that are equally valid and plausible. I struggle, however, to accept the following interpretation, to which I am awarding the 2018 worst use of Statistics award:

 ONS data showing real wages rising at fastest rate in 10 years… is good news for working Britain

Now, believe me, I’ve looked very hard at the graph to try to find a way in which this statement provides a reasonable interpretation of it, but I simply can’t. You might argue that wages grew at the fastest rate in a decade during 2015, but only then because wages had performed so miserably in the preceding years.  But any reasonable interpretation of the graph suggests current wages have flatlined since 2016, and it’s simply misleading to suggest that wages are currently rising at the fastest rate in 10 years. 

So, my 2018 award for the worst use of Statistics goes to…

… Dominic Raab, who until his recent resignation was the Secretary of State responsible for the United Kingdom’s withdrawal from the European Union (i.e. Brexit) and is a leading contender to replace Theresa May as the next leader of the Conservative Party.

Well done Dominic. Whether due to mendacity or ignorance, you are a truly worthy winner.

Statistics by pictures

Generally speaking there are three main phases to any statistical analysis:

  1. Design;
  2. Execution;
  3. Presentation.

Graphical techniques play an important part in both the second and third phases, but the emphasis is different in each. In the second phase the aim is usually exploratory, using graphical representations of data summaries to hunt for structure and relationships that might subsequently be exploited in a formal statistical model. The graphs here tend to be quick but rough, and are intended more for the statistician than the client.

In the presentation phase the emphasis is a bit different, since the analysis has already been completed, usually involving some sort of statistical model and inference. In this case diagrams are used to highlight the results to clients or a wider audience in a way that illustrates most effectively the salient features of the analysis. Very often the strength of message from a statistical analysis is much more striking when presented graphically rather than in the form of numbers. Moreover, some statisticians have also developed the procedure into something of an art form, using graphical techniques not just to convey the results of the analysis, but also to put them back in the context from where the data derive.

One of my favourite exponents of this technique is Mona Chalabi, who has regular columns in the Guardian. among other places.

Here are a few of her examples:

Most Popular Dog Names in New York

mona_2

A Complete History of the Legislation of Same-Sex Marriage 

mona4

The Most Pirated Christmas Movies

mona_1

And last and almost certainly least…

Untitled

mona5

Tell you what though… that looks a bit more than 16% to me, suggesting a rather excessive use of artistic licence in this particular case.

Mr. Wrong

 

As a footnote to last week’s post ‘How to be wrong‘, I mentioned that Daniel Kahneman had been shown to be wrong by using unreliable research in his book ‘Thinking, Fast and Slow’. I also suggested that he had tried to deflect blame for this oversight, essentially putting all of the blame on the authors of the work which he cited.

I was wrong..

Luigi.Colombo@smartodds.co.uk pointed me to a post by Kahneman in the comments section of the blog post I referred to in which Kahneman clearly takes responsibility for the unreliable interpretations he included in his book, and explaining in some detail why they were made. In other words, he’s being entirely consistent with the handy guide for being wrong that I included in my original post.

Apologies.


But while we’re here, let me just explain in slightly more detail what the issue was with Kahneman’s analysis…

As I’ve mentioned in other settings, if we get a result based on a very small sample size, then that result has to be considered not very reliable. But if you get similar results from several different studies, all based on small sample sizes, then the combined strength of evidence is increased. There are formal ways of combining results in this way, and it often goes under the name of ‘meta-analysis‘. This is a very important technique, especially as time and money constraints often mean the sample sizes in individual studies are small, and Kahneman used this approach – at least informally – to combine the strength of evidence from several small-sample studies. But there’s a potential problem. Not all studies into a phenomenon get published. Moreover, there’s a tendency for those having ‘interesting results’ to be more likely to be published than others. But a valid combination of information should include results from all studies, not just those with results in a particular direction.

Let’s consider a simple made-up example. Suppose I’m concerned that coins are being produced that have a propensity to come up Heads when tossed. I set up studies all around the country where people are asked to toss a coin 10 times and report whether they got 8 or more heads in their experiments. In quite a few of the studies the results turn out to be positive – 8 or more heads – and I encourage the researchers in those studies to publish the results. Now, 8 or more heads in any one study is not especially unusual: 10 is a very small sample size. So nobody gets very excited about any one of these results. But then, perhaps because they are researching for a book, someone notices that there are many independent studies all suggesting the same thing. They know that individually the results don’t say much, but in aggregate form the results are overwhelming that coins are being produced with a tendency towards Heads. And they conclude that there is very strong evidence that coins are being produced with a tendency to come up Heads. But this was a false conclusion, due to the fact that the overwhelming number of studies where 8 or more Heads weren’t obtained didn’t get published.

And that’s exactly what happened to Kahneman. The uninteresting results don’t get published, while the interesting ones do, even if they are not statistically reliable due to small sample sizes. Then someone combines via meta-analysis the published results, and gets a totally biased picture.

That’s how easy it is to be wrong.

It’s not based on facts

We think that this is the most extreme version and it’s not based on facts. It’s not data-driven. We’d like to see something that is more data-driven.

Wow! Who is this staunch defender of statistical methodology? This guardian of scientific method. This warrior of the value of empirical information to help identify and confirm a truth.

Ah, but wait a minute, here’s the rest of the quote…

It’s based on modelling, which is extremely hard to do when you’re talking about the climate. Again, our focus is on making sure we have the safest, cleanest air and water.

Any ideas now?

Since it requires an expert in doublespeak to connect those two quotes together, you might be thinking Donald Trump, but we’ll get to him in a minute. No, this was White House spokesperson Sarah Sanders in response to the US government’s own assessment of climate change impact. Here’s just one of the headlines in that report (under the Infrastructure heading):

Our Nation’s aging and deteriorating infrastructure is further stressed by increases in heavy precipitation events, coastal flooding, heat, wildfires, and other extreme events, as well as changes to average precipitation and temperature. Without adaptation, climate change will continue to degrade infrastructure performance over the rest of the century, with the potential for cascading impacts that threaten our economy, national security, essential services, and health and well-being.

I’m sure I don’t need to convince you of the overwhelming statistical and scientific evidence of climate change. But for argument’s sake, let me place here again a graph that I included in a previous post

This is about as data-driven as you can get. Data have been carefully sourced and appropriately combined from locations all across the globe. Confidence intervals have been added – these are the vertical black bars – which account for the fact that we’re estimating a global average on the basis of a limited number of data. But you’ll notice that the confidence bars are smaller for more recent years, since more data of greater reliability is available. So it’s not just data, it’s also careful analysis of data that takes into account that we are estimating something. And it plainly shows that, even after allowance for errors due to data limitation, and also allowance for year-to-year random variation, there has been an upward trend for at least the last 100 years,  which is even more pronounced in the last 40 years.

Now, by the way, here’s a summary of the mean annual total of CO2 that’s been released into the atmosphere over roughly the same time period.

Notice any similarities between these two graphs?

Now, as you might remember from my post on Simpson’s Paradox, correlations are not necessarily evidence of causation. It could be, just on the strength of these two graphs, that both CO2 emission and global mean temperature are being affected by some other process, which is causing them both to change in a similar way. But, here’s the thing: there is a proven scientific mechanism by which an increase in CO2 can cause an increase in atmospheric temperature. It’s basically the greenhouse effect: CO2 particles cause heat to be retained in the atmosphere, rather than reflected back into space, as would be the case if those particles weren’t there. So:

  1. The graphs show a clear correlation between C02 levels and mean temperature levels;
  2. CO2 levels in the atmosphere are rising and bound to rise further under current energy polices worldwide;
  3. There is a scientific mechanism by which increased CO2 emissions lead to an increase in mean global temperature.

Put those three things together and you have an incontrovertible case that climate change is happening, that it’s at least partly driven by human activity and that the key to limiting the damaging effects of such change is to introduce energy policies that drastically reduce C02 emissions.

All pretty straightforward, right?

Well, this is the response to his own government’s report by the President of the United States:

In summary:

I don’t believe it

And the evidence for that disbelief:

One of the problems that a lot of people like myself — we have very high levels of intelligence, but we’re not necessarily such believers.

If only the President of the United States was just a little less intelligent. And if only his White House spokesperson wasn’t such an out-and-out liar.

 

Regression to the mean

kahneman

When Matthew.Benham@smartbapps.co.uk was asked at a recent offsite for book recommendations, his first suggestion was  Thinking, Fast and Slow.  This is a great book, full of insights that link together various worlds, including statistics, economics and psychology. Daniel Kahneman, the book’s author, is a world-renowned psychologist in his own right, and his book makes it clear that he also knows a lot about statistics. However, in a Guardian article  a while back, Kahneman was asked the following:

Interestingly, it’s a fact that highly intelligent women tend to marry men less intelligent than they are. Why do you think this might be?

He answered as follows:

It’s a fact – but it’s not interesting at all. Assuming intelligence is similarly distributed between men and women, it’s just a mathematical inevitability that highly intelligent women, on average, will be married to men less intelligent than them. This is “regression to the mean”, and all it really tells you is that there’s no perfect correlation between spouses’ intelligence levels. But our minds are predisposed to try to construct a more compelling explanation.

<WOMEN: please insert your own joke here about the most intelligent women choosing not to marry men at all.>

Anyway, I can’t tell if this was Kahneman thinking fast or slow here, but I find it a puzzling explanation of regression to the mean, which is an important phenomenon in sports modelling. So, what is regression to the mean, why does it occur and why is it relevant to Smartodds?

Let’s consider these questions by looking at a specific dataset. The following figure shows  the points scored in the first and second half of each season by every team in the Premier League since the inaugural 1992-93 season. Each point in the plot represents a particular team in a particular season of the Premier League. The horizontal axis records the points scored by that team in the first half of the season; the vertical axis shows the number of points scored by the same team in the second half of the same season.

 

pp1Just to check your interpretation of the plot, can you identify:

  1. The point which corresponds to Sunderland’s 2002-03 season where they accumulated just a single point in the second half of the season?
  2. The point which corresponds to Man City’s 100-point season in 2017-18?

Click here to see the answers.

Now, let’s take that same plot but add a couple of lines as follows:

  • The red line divides the data into roughly equal sets. To its left are the points that correspond to the 50% poorest first-half-of-season performances; to its right are the 50% best first-half-of-season performances.
  • The green line corresponds to teams who had an identical performance in the first and second half of a season. Teams below the green line performed better in the first half of a season than in the second; teams above the green line performed better in the second half of a season than in the first.

pp2

 

In this way the picture is divided into 4 regions that I’ve labelled A, B, C and D. The performances within a season of the teams falling in these regions are summarised in the following table:

First Half Best half Number of points
A Below average First 94
B Above average First 174
C Above average Second 71
D Below average Second 187

I’ve also included in the table the number of points in each of the regions. (Counting directly from the figure will give slightly different numbers because of overlapping points).

First compare A and D, the teams that performed below average in the first half of a season. Looking at the number of points, such teams are much more likely to have had a better second half to the season (187 to 94). By contrast, comparing B and C, the teams that do relatively well in the first half of the season are much more likely to do worse in the second half of the season (71 to 174).

This is regression to the mean. In the second half of a season teams “regress” towards the average performance: teams that have done below average in the first half of the season generally do a bit less badly in the second half; teams that have done well in the first half generally do a bit less well in the second half. In both cases there is a tendency to  move – regress – towards the average in the second half. I haven’t done anything to force this; it’s just what happens.

We can also view the phenomenon in a slightly different way. Here’s the same picture as above, where points falling on the green line would correspond to a team doing equally well in both halves of the season. But now I’ve also used standard statistical methods to add a “line of best fit” to the data, which is shown in orange. This line is a predictor of how teams will perform in the second half of season given how they performed in the first, based on all of the data shown in the plot.

pp3

In the left side of the plot are teams who have done poorly in the first half of the season. In this region the orange line is above the green line, implying that such teams are predicted to do better in the second half of the season. On the right side of the plot are the teams who have done well in the first half of the season. But here the orange line is below the green line, so these teams are predicted to do worse in the second half of the season. This, again, is the essence of regression to the mean.

One important thing though: teams that did well in the first half of the season still tend to do well in the second half of the season; the fact that the orange line slopes upwards confirms this. It’s just that they usually do less well than they did in the first half; the fact that the orange line is less steep than the green line is confirmation of that. Incidentally, you’ve probably heard the term “regression line” used to describe a “line of best fit”, like the orange line. The origins of this term are precisely because the fit often involves a regression to the mean, as we’ve seen here.

But why should regression to the mean be such an intrinsic phenomenon that it occurs in football, psychology and a million other places? I just picked the above data at random: I’m pretty sure I could have picked data from any competition in any country – and indeed any sport – and I’d have observed the same effect. Why should that be?

Let’s focus on the football example above. The number of points scored by a team over half a season (so they’ve played all other teams) is dependent on two factors:

  1. The underlying strength of the team compared to their opponents; and
  2. Luck.

Call these S (for strength) and L (for luck) and notionally let’s suppose they add together to give the total points (P). So

P = S + L

Although there will be some changes in S over a season, as teams improve or get worse, it’s likely to be fairly constant. But luck is luck. And if a team has been very lucky in the first half of the season, it’s unlikely they’ll be just as lucky in the second. And vice versa. For example, if you roll a dice and get a 6, you’re likely to do less well with a second roll. While if you roll a 1, you’ll probably do better on your next roll. So while S is pretty static, if L was unusually big or small in the first half of the season, it’s likely to be closer to the average in the second half. And the overall effect on P? Regression to the mean, as seen in the table and figures above.

Finally: what’s the relevance of regression to the mean to sports modelling? Well, it means that we can’t simply rely on historic performance as a predictor for future performance.  We need to balance historic performance with average performance to compensate for inevitable regression to the mean effects; all of our models are designed with exactly this feature in mind.

 

Enjoy the universe while you can

universe

I’ve mentioned in the past that one of the great things about Statistics is the way it’s a very connected subject. A technique learnt for one type of application often turns out to be relevant for something completely different. But sometimes the connections are just for fun.

Here’s a case in point. A while back I wrote a post making fun of Professor Brian Cox, world renowned astrophysicist, who seemed to be struggling to get to grips with the intricacies of the Duckworth-Lewis method for adjusting runs targets in cricket matches that have been shortened due to poor weather conditions. You probably know, but I forgot to mention, that in his younger days Brian was the keyboard player for D:Ream. You’ll have heard their music even if you don’t recognise the name. Try this for example:

 

Anyway, shortly after preparing that earlier post, I received the following in my twitter feed:

I very much doubt it’s true, but I love the idea that the original version of

Things can only get better

was going to be

Things inexorably get worse, there’s a statistical certainty that the universe will fall to bits and die

Might not have had the same musical finesse, but is perhaps a better statement on the times we live in. Or as Professor Cox put it in his reply:

 

xG, part 1

Adam.Weinrich@smartodds.co.uk wrote and asked for a discussion of xG. I’m so happy about this suggestion that I’m actually going to do two posts on the topic. In this post we’ll look at the xG for a single shot on goal; in a subsequent post we’ll discuss the xG for a passage of play and for an entire game.

xG stands for expected goals, and it’s famous enough now that it’s used almost routinely on Match of the Day. But what is it, why is it used, how is it calculated and is it all it’s cracked up to be?

It’s well-understood these days when trying to assess how well a team has performed in a game, that because goals themselves are so rare,  it’s better to go beyond the final result and look at the match in greater statistical detail.

For example, this screenshot shows the main statistics for the recent game between Milan and Genoa, as provided by Flashscore. Milan won 2-1, but it’s clear from the data here that they also dominated the game in terms of possession and goal attempts. So, on the basis of this information alone, the result seems fair.

Actually, Milan’s winner came in injury time, and if they hadn’t got that goal, again on the basis of the above statistics, you’d probably argue that they would have been unlucky not to have won. In that case the data given here in terms of shots and possession would have given a fairer impression of the way the match played out than just the final result.

But even these statistics can be misleading: maybe most of Milan’s goal attempts were difficult, and unlikely to lead to goals, whereas Genoa’s fewer attempts were absolute sitters that they would score 9 times out of 10. If that were the case, you might conclude instead that Genoa were unlucky to lose. xG – or expected goals – is an attempt to take into account not just the number of chances a team creates, but also the difficulty of those chances.

The xG for a single attempt at goal is an estimate of the probability that, given the circumstances of a shot – the position of the ball, whether the shot is kicked or a header, whether the shot follows a dribble or not, and other relevant information – it is converted into a goal.

This short video from OPTA gives a pretty simple summary.

 

 

So how is xG calculated in practice? Let’s take a simple example. Suppose a player is 5 metres away from goal with an open net. Looking back through a database of many games, we might find (say) 1000 events of an almost identical type, and on 850 of those occasions a goal was scored. In that case the xG would be estimated as 850/1000 = 0.85. But breaking things down further, it might be that 900 of the 1000 events were kicked shots, while 100 were headers; and the number of goals scored respectively from these events were 800 and 50. We’d then calculate the xG for this event as 800/900 = 0.89 for a kicked shot, but  50/100 = 0.5 for a header.

But there are some complications. First, there are unlikely to be many events in the database corresponding to exactly the same situation (5 metres away with an open goal). Second, we might want to take other factors into account: scoring rates from the same position are likely to be different in different competitions, for example, or the scoring rate might depend on whether the shot follows a dribble by the same player. This means that simple calculations of the type described above aren’t feasible. Instead, a variation of standard regression – logistic regression – is used.  This sounds fancy, but it’s really just a statistical algorithm for finding the best formula to convert the available variables (ball position, shot type etc.) into the probability of a goal.

So in the end, xG is calculated via a formula that takes a bunch of information at the time of a shot – ball position, type of shot etc. etc. – and converts it into a probability that the shot results in a goal. You can see what xG looks like using this  simple app.

Actually, there are 2 alternative versions of xG here, that you can switch between in the first dialog box. For both versions, the xG will vary according to whether the shot is a kick or a header. For the second version the xG also depends on whether the shot is assisted by a cross, or preceded by a dribble: you select these options with the remaining dialog boxes. In either case, with the options selected, clicking on the graphic of the pitch will return the value of xG according to the chosen model. Naturally, as you get closer to the goal and with a more favourable angle the xG increases.

One point to note about xG is that there is no allowance for actual players or teams. In the OPTA version there is a factor that distinguishes between competitions – presumably since players are generally better at converting chances in some competitions than others – but the calculation of xG is identical for all players and teams in a competition. Loosely speaking, xG is the probability a shot leads to a goal by an average player who finds themselves in that position in that competition. So the actual xG, which is never calculated, might be higher if it’s a top striker from one of the best teams, but lower if it’s a defender who happened to stray into that position. And in exactly the same way, there is no allowance in the calculation of xG for the quality of the opposition: xG averages over all players, both in attack and defence.

It follows from all this discussion that there’s a subtle difference between xG and the simpler  statistics of the kind provided by Flashscore. In the latter case, as with goals scored, the statistics are pure counts of different event types. Apart from definitions of what is a ‘shot on goal’, for example, two different observers would provide exactly the same data. xG is different: two different observers are likely to agree on the status of an event – a shot on an open goal from the corner of the goal area, for example – but they may disagree on the probability of such an event generating a goal. Even the two versions in the simple app above gave different values of xG, and OPTA would give a different value again. So xG is a radically different type of statistic; it relies on a statistical model for converting situational data into probabilities of goals being scored, and different providers may use different models.

We’ll save discussion about the calculation of xG for a whole match or for an individual player in a whole match for a subsequent post. But let me leave you with this article from the BBC. The first part is a summary of what I’ve written here – maybe it’s even a better summary than mine. And the second part touches on issues that I’ll discuss in a subsequent post. But half way down there’s a quiz in which five separate actions are shown and you’re invited to guess the value of xG for each. See if you can beat my score of 2/5.


Incidentally, why do we use the term ‘expected goals’ if xG is a probability? Well, let’s consider the simpler experiment of tossing a coin. Assuming it’s a fair coin, the probability of getting a head is 0.5. In (say) 1000 tosses of the coin, on average I’d get 500 heads. That’s 0.5 heads per toss, so as well as being the probability of a head, 0.5 is also the number of heads we expect to get (on average) when we toss a single coin. xH if you like. And the same argument would work for a biased coin that has probability 0.6 of coming up heads: xH = 0.6. And exactly the same argument works for goals: if xG is the probability of a certain type of shot becoming a goal, it’s also the expected goals we’d expect, per event, from events of that type.


And finally… if there are any other statistical topics that you’d like me to discuss in this blog, whether related to sports or not, please do write and let me know.