Simpson’s paradox de-paradoxed


In an earlier post I gave two examples of Simpson’s paradox. The first example concerned the success rate of two different procedures for the removal of kidney stones; the second concerned the batting averages of two baseball players. In both cases there seemed to be a contradiction in the data depending on how the data were analysed. I’d like now to try to explain this phenomenon.

Though the context is slightly different and I’ve actually just invented the data, you can perhaps see what might be happening from the following picture. These are (fictional) measures of ratings of a number of  sportspeople against their age. What conclusions would you draw?


It’s a noisy picture, but the general pattern seems to be that rating improves with age. Indeed, if I use standard statistical procedures to estimate the general trend in this picture I get the green line as shown below:


This confirms a general tendency for rating to improve with age (notwithstanding some variation around the general trend). But suppose I now tell you that actually these data correspond to several observations through time on just two players. In the plot below, I’ve coloured the observations separately for the two players. What do you conclude now?


You’d probably conclude that the blue player is more highly rated than the red player, but for both players ratings reduce with age. This is again confirmed with formal estimates of trend lines for each of the players:


So, when looking at the two players separately, age causes ratings to go down for each player. But aggregating the player data, as we did in the first plot, leads to the misleading conclusion that age tends to result in increased ratings. It does for the aggregated data if we ignore the player information, but the more likely explanation for that is that the older of the two players in our data just happens to be a much better player, and that the real effect of age is a reduction in ratings for both players.

This is Simpson’s paradox: by ignoring a hidden variable (in this case the player identifier) we get a misleading picture of the relationship between the original variables (rating and age). Sure, ratings increase with age, but only because the older player had much higher ratings overall. Looking separately at each player, ratings go down with age.

A version of this same phenomenon occurs in each of the examples from the previous post.

  1. If you look back at the kidney stones data, doctors tended to give treatment A to patients with the more severe disease (larger kidney stones). This reduces the success rate for treatment A; not because it’s a less effective treatment, but because it’s being used on patients whose condition is more severe. Indeed, it looked like treatment B was best from the aggregate data. But the true story emerges from the original tables: treatment A is best for all patients.
  2. Simpson’s paradox arises in the baseball example because of the large differences in the number of appearances at the plate per year for the two batsmen. Derek Jeter has far more appearances in 1996 than 1995; for David Justice it is the reverse. This means that the aggregate batting average for Jeter is close to his ’96 value, while for Justice it is his ’95 value. Moreover both players had better averages in ’96 compared to ’95. Consequently, the overall averages favour Jeter, who had most of his appearances in ’96, the year in which the averages were higher. Yet even in ’96 Justice’s average was higher; it’s just that it was based on relatively few appearances. Clearly, Derek Jeter was the better batter over the entire period, despite the quirk of having a lower average than David Justice in both years.

What’s especially interesting from these two examples is that the ‘correct’ resolution of the paradox is completely different in the two cases. For the medical example, taking the experimental situation into account, the non-aggregate interpretation is best: treatment A was best for both types of kidney stone and should be preferred, even though treatment B had the highest overall success rate. But with the baseball data,  Derek Jeter was the superior batter since he had the highest overall average, even though his average was beaten by that of David Justice in both years.

The moral is that Statistics is bound to be a more intricate process than that of simple number crunching. Here we had two different situations which led to the same phenomenon of Simpson’s paradox. But in one case an understanding of the experimental setting supports a non-aggregated solution; in the other the aggregate solution is best. Context is everything: treat data as if they are numbers without context and there’s a very good chance you’ll draw entirely the wrong conclusions. pointed me to a gif that illustrates Simpson’s paradox in much the same way as my non-animated graphs above. I’m not sure this is exactly the gif Harry suggested, but the gist is much the same. So if you prefer your Simpson’s paradox explanations all-dancing and in Technicolor, here you go:



You probably get the idea by now. Looking at just the raw data (the black dots before the animation starts) there is a strong downward trend (shown by the red line once you start the video). But if you let the video roll you’ll see that different groups of the data belong to different individuals, as indicated by the different colours. The trend line for every one of those individuals is positive, even though the overall trend was distinctly negative.


Simpson’s paradox


Here’s a fictional conversation from Match of the Day:

Lineker: United have picked up just 3 points from their opening 3 games. How many years it is since United had such a terrible start to a season?

Shearer: Ooh, that’s one for the statisticians.

It’s fictional because I don’t think it actually occurred. But it’s real in the sense that it’s a typical conversation reflecting a commonly-held view about the importance of statistics and the role of statisticians in a sporting context. I want to de-bunk this point of view,  and one of my aims in this blog is to show how statistics has a much more important role in the study of sports, above and beyond dredging through the history books to identify periods of bad United results.

In this post we’ll look at Simpson’s paradox. It’s a simple and unsettling phenomenon that arises in many different situations, and provides an illustration of why Statistics is more than just summarising data. We’ll look at two real-life examples (both taken from Wikipedia).

The first set of data come from a medical trial into the success rates of procedures for the removal of kidney stones. The study compared two available procedures, labelled A and B respectively, and analysed the results separately for both small and large kidney stones.

The success rates for a sample of patients with small kidney stones are given in the following table.

Small Stones Treatment A Treatment B
Success Rate 81/87 = 93% 234/270 = 87%

So, for example, 87 patients were given treatment A, and in 81 of these cases the treatment was deemed successful. This corresponds to a success rate of 93%. Similarly, the success rate for the 270 patients given treatment B was 87%.

For patients with a large kidney stone, the success rates using treatments A and B are summarised in the same way in the following table:

Large Stones Treatment A Treatment B
Success Rate 192/263 = 73% 55/80 = 69%

As is clear from the tables, for patients with either small or large kidney stones, treatment A has a higher success rate than treatment B, and if you were a doctor having to decide which treatment to offer to a patient, all other things being equal you’d surely choose treatment A for both types of patient.

But suppose we group all the patients together, simply adding the data from the previous tables, and then calculate the success rates with either treatment. This results in the following table (check for yourselves):

All Stones Treatment A Treatment B
Success Rate 273/350 = 78% 289/350 = 83%

Remarkably,  from exactly the same data, treatment B now has a higher success rate than treatment A!

This is Simpson’s paradox.  Having just the information from the combined table, a doctor would recommend Treatment B. But having the two separate tables for small and large kidney stones, a doctor would recommend Treatment A for both types of patient. It seems to defy all reasonable logic.

I’ll leave you to think about (or to Google) this example for a few days. I’ll then post again with some discussion.

But first here’s another example, this time in a sporting context. In baseball the standard measure of a batter’s performance is their batting average: roughly speaking, the proportion of times they make a successful hit from an appearance at the plate. The following tables compare the batting averages of two particular batters in 1995 and 1996 respectively:

1995 Derek Jeter David Justice
Batting Average 12/48 = 25% 104/411= 25.3%
1996 Derek Jeter David Justice
Batting Average 183/582 = 31.4% 45/140 = 32.1%

So, for example, in 1995 Derek Jeter made 48 appearances at the plate and made 12 hits, leading to a batting average of 25%. And in the same year David Justice recorded a batting average of 25.3%. Indeed, comparing the averages in both tables, David Justice recorded a higher batting average than Derek Jeter in both 1995 and 1996.

But, if we combine the data from the two tables to get the results for the entire period 1995-96 and re-calculate the averages, we get the following:

1995-96 Derek Jeter David Justice
Batting Average 195/630 = 31% 149/551 = 27%

We see Simpson’s paradox again. Derek Jeter has a higher batting average over the entire period even though David Justice had the superior average in each of the 2 seasons. So who was the better batter?

Like I say, I’ll leave this here for a while and discuss again later. Feel free to add something in the comments section if you’d like to discuss or ask questions.

One final thing: although I’ll save discussion of this paradox till another post, I will say that it doesn’t arise just out of chance. I mean, it’s not just a quirk of having too few data and that if we had bigger sample sizes it would all just go away. It’s a genuine – and rather disturbing – phenomenon, and can only be resolved by a deeper understanding of statistics than the arithmetic analysis provided above.

Footnote. Here you go Alan: at the time of writing (after 3 games) this is Man United’s worst start to a season since the 92/93 season when they lost their opening 2 games to Sheffield United and Everton, and drew their third against Ipswich. Terrible. But they did go on to win the league that season by 10 points!