Time to kill?


Smartodds loves Statistics would like to remind you that the clocks go back an hour this weekend.

You probably heard that the EU is planning to end the practice of switching between ‘summer’ and ‘winter’ times, in which clocks are artificially moved back and forward by an hour at the end of October and March respectively. The rationale for this procedure of so-called daylight saving is closely linked to historical social, agricultural and industrial demands on energy supplies, but what was relevant a century ago, when the practice was first devised, is rather less relevant today.

Some media stories also suggest that putting an end to daylight saving is rather more urgent. For example: “Daylight Savings Time Literally Kills People“. Or even more dramatically: “Why Daylight Saving Time will Kill us All“.

In part there is some basis to these stories. Messing slightly with people’s regular sleep patterns can induce extra tiredness, and there is some evidence that over an entire population this can lead to an increase in the number of  driving-related and other accidental deaths. The effect is very slight though, and really says more about the effect of sleep-deprivation on accidental deaths than it does about daylight saving per se.

Rather more surprising and intriguing though is an apparent increase in the rate of heart attacks on the day after clocks go forward an hour in March, with a similar decrease on the day after they go back in October.  A recent study published by the British Medical Journal found that there was a 24% increase in patients presenting for acute heart attacks on the day after clocks go forward, and a 21% decrease on the day after clocks go back. This was based on a study of many patients over several years, and so the differences are too big just to have occurred by chance. So what’s going on? Does daylight saving give people heart attacks?

Well, the first thing a statistician will do is look for other factors which might explain the results. For example:

  1. Since clocks always change early on a Sunday morning, are Sunday, or maybe Monday, generally different from other days of the week in terms of heart attack rates, regardless of the clock change effect?
  2. Are there more heart attacks generally at some times of the year compared to others?

The answer to both these questions is yes, but in the analysis reported by the BMJ both of these effects, and others, were accounted for, so the unusual increases and decreases following daylight saving time changes are after such allowances have been made. So again, what’s going on? Does moving the clocks induce heart attacks?

Well, not really. When the researchers of the BMJ study counted the number of patients attending hospital with heart attacks within the entire week following a change in daylight saving, rather than just the next day, then they found no difference at all following  the time change in March or October. Perhaps for physiological or social reasons, heart attacks appear to be slightly delayed – on average – after the change in October, and sped up after the change in March. So if you look only at the days immediately following the change, it does look like the change itself is changing the rate of heart attacks. But over a slightly longer window of a week or so, there’s no evidence of a change at all.

In summary, moving the clocks forward or backwards won’t induce anyone to have a heart attack who wasn’t going to have one anyway; the change might just cause someone’s heart attack to occur slightly earlier or later in the same week.

There seem to be two useful messages from this:

  1. As with Simpson’s paradox, we see the danger of simply carrying out a statistical analysis without taking into account the context. Testing the daily data for whether is a change in heart attack rates when clocks are changed suggests there is an effect. But understanding the context of the problem and looking at the data over a slightly longer timespan indicates that there is no real change.
  2. The media are often just interested in a good story, and won’t let concerns about the quality of a statistical analysis get in the way of that.

I stole most of this material from Matt Parker, who describes himself as a standup mathematician. (I know!) Anyway, if you’re interested, here’s his take on the issue:

Simpson’s paradox de-paradoxed


In an earlier post I gave two examples of Simpson’s paradox. The first example concerned the success rate of two different procedures for the removal of kidney stones; the second concerned the batting averages of two baseball players. In both cases there seemed to be a contradiction in the data depending on how the data were analysed. I’d like now to try to explain this phenomenon.

Though the context is slightly different and I’ve actually just invented the data, you can perhaps see what might be happening from the following picture. These are (fictional) measures of ratings of a number of  sportspeople against their age. What conclusions would you draw?


It’s a noisy picture, but the general pattern seems to be that rating improves with age. Indeed, if I use standard statistical procedures to estimate the general trend in this picture I get the green line as shown below:


This confirms a general tendency for rating to improve with age (notwithstanding some variation around the general trend). But suppose I now tell you that actually these data correspond to several observations through time on just two players. In the plot below, I’ve coloured the observations separately for the two players. What do you conclude now?


You’d probably conclude that the blue player is more highly rated than the red player, but for both players ratings reduce with age. This is again confirmed with formal estimates of trend lines for each of the players:


So, when looking at the two players separately, age causes ratings to go down for each player. But aggregating the player data, as we did in the first plot, leads to the misleading conclusion that age tends to result in increased ratings. It does for the aggregated data if we ignore the player information, but the more likely explanation for that is that the older of the two players in our data just happens to be a much better player, and that the real effect of age is a reduction in ratings for both players.

This is Simpson’s paradox: by ignoring a hidden variable (in this case the player identifier) we get a misleading picture of the relationship between the original variables (rating and age). Sure, ratings increase with age, but only because the older player had much higher ratings overall. Looking separately at each player, ratings go down with age.

A version of this same phenomenon occurs in each of the examples from the previous post.

  1. If you look back at the kidney stones data, doctors tended to give treatment A to patients with the more severe disease (larger kidney stones). This reduces the success rate for treatment A; not because it’s a less effective treatment, but because it’s being used on patients whose condition is more severe. Indeed, it looked like treatment B was best from the aggregate data. But the true story emerges from the original tables: treatment A is best for all patients.
  2. Simpson’s paradox arises in the baseball example because of the large differences in the number of appearances at the plate per year for the two batsmen. Derek Jeter has far more appearances in 1996 than 1995; for David Justice it is the reverse. This means that the aggregate batting average for Jeter is close to his ’96 value, while for Justice it is his ’95 value. Moreover both players had better averages in ’96 compared to ’95. Consequently, the overall averages favour Jeter, who had most of his appearances in ’96, the year in which the averages were higher. Yet even in ’96 Justice’s average was higher; it’s just that it was based on relatively few appearances. Clearly, Derek Jeter was the better batter over the entire period, despite the quirk of having a lower average than David Justice in both years.

What’s especially interesting from these two examples is that the ‘correct’ resolution of the paradox is completely different in the two cases. For the medical example, taking the experimental situation into account, the non-aggregate interpretation is best: treatment A was best for both types of kidney stone and should be preferred, even though treatment B had the highest overall success rate. But with the baseball data,  Derek Jeter was the superior batter since he had the highest overall average, even though his average was beaten by that of David Justice in both years.

The moral is that Statistics is bound to be a more intricate process than that of simple number crunching. Here we had two different situations which led to the same phenomenon of Simpson’s paradox. But in one case an understanding of the experimental setting supports a non-aggregated solution; in the other the aggregate solution is best. Context is everything: treat data as if they are numbers without context and there’s a very good chance you’ll draw entirely the wrong conclusions.

Harry.Hill@smartodds.co.uk pointed me to a gif that illustrates Simpson’s paradox in much the same way as my non-animated graphs above. I’m not sure this is exactly the gif Harry suggested, but the gist is much the same. So if you prefer your Simpson’s paradox explanations all-dancing and in Technicolor, here you go:



You probably get the idea by now. Looking at just the raw data (the black dots before the animation starts) there is a strong downward trend (shown by the red line once you start the video). But if you let the video roll you’ll see that different groups of the data belong to different individuals, as indicated by the different colours. The trend line for every one of those individuals is positive, even though the overall trend was distinctly negative.


Simpson’s paradox


Here’s a fictional conversation from Match of the Day:

Lineker: United have picked up just 3 points from their opening 3 games. How many years it is since United had such a terrible start to a season?

Shearer: Ooh, that’s one for the statisticians.

It’s fictional because I don’t think it actually occurred. But it’s real in the sense that it’s a typical conversation reflecting a commonly-held view about the importance of statistics and the role of statisticians in a sporting context. I want to de-bunk this point of view,  and one of my aims in this blog is to show how statistics has a much more important role in the study of sports, above and beyond dredging through the history books to identify periods of bad United results.

In this post we’ll look at Simpson’s paradox. It’s a simple and unsettling phenomenon that arises in many different situations, and provides an illustration of why Statistics is more than just summarising data. We’ll look at two real-life examples (both taken from Wikipedia).

The first set of data come from a medical trial into the success rates of procedures for the removal of kidney stones. The study compared two available procedures, labelled A and B respectively, and analysed the results separately for both small and large kidney stones.

The success rates for a sample of patients with small kidney stones are given in the following table.

Small Stones Treatment A Treatment B
Success Rate 81/87 = 93% 234/270 = 87%

So, for example, 87 patients were given treatment A, and in 81 of these cases the treatment was deemed successful. This corresponds to a success rate of 93%. Similarly, the success rate for the 270 patients given treatment B was 87%.

For patients with a large kidney stone, the success rates using treatments A and B are summarised in the same way in the following table:

Large Stones Treatment A Treatment B
Success Rate 192/263 = 73% 55/80 = 69%

As is clear from the tables, for patients with either small or large kidney stones, treatment A has a higher success rate than treatment B, and if you were a doctor having to decide which treatment to offer to a patient, all other things being equal you’d surely choose treatment A for both types of patient.

But suppose we group all the patients together, simply adding the data from the previous tables, and then calculate the success rates with either treatment. This results in the following table (check for yourselves):

All Stones Treatment A Treatment B
Success Rate 273/350 = 78% 289/350 = 83%

Remarkably,  from exactly the same data, treatment B now has a higher success rate than treatment A!

This is Simpson’s paradox.  Having just the information from the combined table, a doctor would recommend Treatment B. But having the two separate tables for small and large kidney stones, a doctor would recommend Treatment A for both types of patient. It seems to defy all reasonable logic.

I’ll leave you to think about (or to Google) this example for a few days. I’ll then post again with some discussion.

But first here’s another example, this time in a sporting context. In baseball the standard measure of a batter’s performance is their batting average: roughly speaking, the proportion of times they make a successful hit from an appearance at the plate. The following tables compare the batting averages of two particular batters in 1995 and 1996 respectively:

1995 Derek Jeter David Justice
Batting Average 12/48 = 25% 104/411= 25.3%
1996 Derek Jeter David Justice
Batting Average 183/582 = 31.4% 45/140 = 32.1%

So, for example, in 1995 Derek Jeter made 48 appearances at the plate and made 12 hits, leading to a batting average of 25%. And in the same year David Justice recorded a batting average of 25.3%. Indeed, comparing the averages in both tables, David Justice recorded a higher batting average than Derek Jeter in both 1995 and 1996.

But, if we combine the data from the two tables to get the results for the entire period 1995-96 and re-calculate the averages, we get the following:

1995-96 Derek Jeter David Justice
Batting Average 195/630 = 31% 149/551 = 27%

We see Simpson’s paradox again. Derek Jeter has a higher batting average over the entire period even though David Justice had the superior average in each of the 2 seasons. So who was the better batter?

Like I say, I’ll leave this here for a while and discuss again later. Feel free to add something in the comments section if you’d like to discuss or ask questions.

One final thing: although I’ll save discussion of this paradox till another post, I will say that it doesn’t arise just out of chance. I mean, it’s not just a quirk of having too few data and that if we had bigger sample sizes it would all just go away. It’s a genuine – and rather disturbing – phenomenon, and can only be resolved by a deeper understanding of statistics than the arithmetic analysis provided above.

Footnote. Here you go Alan: at the time of writing (after 3 games) this is Man United’s worst start to a season since the 92/93 season when they lost their opening 2 games to Sheffield United and Everton, and drew their third against Ipswich. Terrible. But they did go on to win the league that season by 10 points!