Here’s a fictional conversation from Match of the Day:

Lineker: United have picked up just 3 points from their opening 3 games. How many years it is since United had such a terrible start to a season?

Shearer: Ooh, that’s one for the statisticians.

It’s fictional because I don’t think it actually occurred. But it’s real in the sense that it’s a typical conversation reflecting a commonly-held view about the importance of statistics and the role of statisticians in a sporting context. I want to de-bunk this point of view, and one of my aims in this blog is to show how statistics has a much more important role in the study of sports, above and beyond dredging through the history books to identify periods of bad United results.

In this post we’ll look at Simpson’s paradox. It’s a simple and unsettling phenomenon that arises in many different situations, and provides an illustration of why Statistics is more than just summarising data. We’ll look at two real-life examples (both taken from Wikipedia).

The first set of data come from a medical trial into the success rates of procedures for the removal of kidney stones. The study compared two available procedures, labelled A and B respectively, and analysed the results separately for both small and large kidney stones.

The success rates for a sample of patients with small kidney stones are given in the following table.

Small Stones | Treatment A | Treatment B |
---|---|---|

Success Rate | 81/87 = 93% | 234/270 = 87% |

So, for example, 87 patients were given treatment A, and in 81 of these cases the treatment was deemed successful. This corresponds to a success rate of 93%. Similarly, the success rate for the 270 patients given treatment B was 87%.

For patients with a large kidney stone, the success rates using treatments A and B are summarised in the same way in the following table:

Large Stones | Treatment A | Treatment B |
---|---|---|

Success Rate | 192/263 = 73% | 55/80 = 69% |

As is clear from the tables, for patients with either small or large kidney stones, treatment A has a higher success rate than treatment B, and if you were a doctor having to decide which treatment to offer to a patient, all other things being equal you’d surely choose treatment A for both types of patient.

But suppose we group all the patients together, simply adding the data from the previous tables, and then calculate the success rates with either treatment. This results in the following table (check for yourselves):

All Stones | Treatment A | Treatment B |
---|---|---|

Success Rate | 273/350 = 78% | 289/350 = 83% |

Remarkably, from exactly the same data, treatment B now has a higher success rate than treatment A!

This is Simpson’s paradox. Having just the information from the combined table, a doctor would recommend Treatment B. But having the two separate tables for small and large kidney stones, a doctor would recommend Treatment A for **both** types of patient. It seems to defy all reasonable logic.

I’ll leave you to think about (or to Google) this example for a few days. I’ll then post again with some discussion.

But first here’s another example, this time in a sporting context. In baseball the standard measure of a batter’s performance is their batting average: roughly speaking, the proportion of times they make a successful hit from an appearance at the plate. The following tables compare the batting averages of two particular batters in 1995 and 1996 respectively:

1995 | Derek Jeter | David Justice |
---|---|---|

Batting Average | 12/48 = 25% | 104/411= 25.3% |

1996 | Derek Jeter | David Justice |
---|---|---|

Batting Average | 183/582 = 31.4% | 45/140 = 32.1% |

So, for example, in 1995 Derek Jeter made 48 appearances at the plate and made 12 hits, leading to a batting average of 25%. And in the same year David Justice recorded a batting average of 25.3%. Indeed, comparing the averages in both tables, David Justice recorded a higher batting average than Derek Jeter in both 1995 and 1996.

But, if we combine the data from the two tables to get the results for the entire period 1995-96 and re-calculate the averages, we get the following:

1995-96 | Derek Jeter | David Justice |
---|---|---|

Batting Average | 195/630 = 31% | 149/551 = 27% |

We see Simpson’s paradox again. Derek Jeter has a higher batting average over the entire period even though David Justice had the superior average in each of the 2 seasons. So who was the better batter?

Like I say, I’ll leave this here for a while and discuss again later. Feel free to add something in the comments section if you’d like to discuss or ask questions.

One final thing: although I’ll save discussion of this paradox till another post, I will say that it doesn’t arise just out of chance. I mean, it’s not just a quirk of having too few data and that if we had bigger sample sizes it would all just go away. It’s a genuine – and rather disturbing – phenomenon, and can only be resolved by a deeper understanding of statistics than the arithmetic analysis provided above.

Footnote. Here you go Alan: at the time of writing (after 3 games) this is Man United’s worst start to a season since the 92/93 season when they lost their opening 2 games to Sheffield United and Everton, and drew their third against Ipswich. Terrible. But they did go on to win the league that season by 10 points!

## 4 thoughts on “Simpson’s paradox”