No smoke without fire

No one seriously now doubts that cigarette smoking increases your risk of lung cancer and many other diseases, but when the evidence for a relationship between smoking and cancer was first presented in the 1950’s, it was strongly challenged by the tobacco industry.

The history of the scientific fight to demonstrate the harmful effects of smoking is summarised in this article. One difficulty from a statistical point of view was that the primary evidence based on retrospective studies was shaky, because smokers tend to give unreliable reports on how much they smoke. Smokers with illnesses tend to overstate how much they smoke; those who are healthy tend to understate their cigarette consumption. And these two effects lead to misleading analyses of historically collected data.

An additional problem was the difficulty of establishing causal relationships from statistical associations. Similar to the examples in a previous post, just because there’s a correlation between smoking and cancer, it doesn’t necessarily mean that smoking is a risk factor for cancer. Indeed, one of the most prominent statisticians of the time – actually of any time – Sir Ronald Fisher, wrote various scientific articles explaining how the correlations observed between smoking and cancer rates could easily be explained by the presents of lurking variables that induce spurious correlations.

At which point it’s worth noting a couple more ‘coincidences’: Fisher was a heavy smoker himself and also an advisor to the Tobacco Manufacturers Standing Committee. In other words, he wasn’t exactly neutral on the matter. But, he was a highly respected scientist, and therefore his scepticism carried considerable weight.

Eventually though, the sheer weight of evidence – including that from long-term prospective studies – was simply too overwhelming to be ignored, and governments fell into line with the scientific community in accepting that smoking is a high risk factor for various types of cancer.

An important milestone in that process was the work of another British statistician, Austin Bradford Hill. As well as being involved in several of the most prominent cases studies linking cancer to smoking, he also developed a set of 9 (later extended to 10) criteria for establishing a causal relationship between processes. Though still only guidelines, they provided a framework that is still used today for determining whether associated processes include any causal relationships. And by these criteria, smoking was clearly shown to be a risk factor for smoking.

Now, fast-forward to today and there’s a similar debate about global warming:

  1. Is the planet genuinely heating up or is it just random variation in temperatures?
  2. If it’s heating up, is it a consequence of human activity, or just part of the natural evolution of the planet?
  3. And then what are the consequences for the various bio- and eco-systems living on it?

There are correlations all over the place – for example between CO2 emissions and average global temperatures as described in an earlier post – but could these possibly just be spurious and not indicative of any causal relationships?  Certainly there are industries with vested interests who would like to shroud the arguments in doubt. Well, this nice article applies each of Bradford Hill’s criteria to various aspects of climate science data and establishes that the increases in global temperatures are undoubtedly caused by human activity leading to CO2 release in the atmosphere, and that many observable changes to biological and geographical systems are a knock-on effect of this relationship.

In summary: in the case of the planet, the smoke that we see <global warming> is definitely a consequence of the fire we stared <the increased amounts of CO2 released into the atmosphere>.

Famous statisticians: Sir Francis Galton



This is the second in a so-far very short series on famous statisticians from history. You may remember that the first in the series was on John Tukey. As I said at that time, rather than just include statisticians randomly in this series, I’m going to focus on those who have had an impact beyond the realm of just statistics.

With that in mind, this post is about Sir Francis Galton (1822-1911), an English statistician who did most of his work in the second half of the 19th century, around the time that Statistics was being born as a viable scientific discipline.

You may remember seeing Galton’s name recently. In a recent post on the bean machine, I mentioned that the device also goes under the name of ‘Galton board’. This is because Galton was the inventor of the machine, which he used to illustrate the Central Limit Theorem, as discussed in the earlier post. You may also remember an earlier post in which I discussed `regression to the mean’; Galton was also the first person to explore and describe this phenomenon, as well as the more general concept of correlation to describe the extent to which two random phenomena are connected.

It’s probably no coincidence that Galton was a half-cousin of Charles Darwin, since much of Galton’s pioneering work was on the way statistics could be used to understand genetic inheritance and human evolution. Indeed, he is the inventor of the term eugenics, which he coined during his attempts to understand the extent to which intelligence is inherited, rather than developed.

Galton is described in Wikipedia as:

  • A statistician
  • A progressive
  • A polymath
  • A sociologist
  • A psychologist
  • An anthropologist
  • A eugenicist
  • A tropical explorer
  • A geographer
  • An inventor
  • A meteorologist
  • A proto-geneticist
  • A psychometrician

And you thought you were busy. Anyway, it’s fair to say that Galton falls in my category of statisticians who have done something interesting with their lives outside of Statistics.

His various contributions apart from those mentioned above include:

  1. He invented the use of weather maps for popular use;
  2. He wrote a book ‘The Art of Travel’ which offered practical travel advice to Victorians;
  3. He was the first to propose the use of questionnaires as a means of data collection;
  4. He conceived the notion of standard deviation as a way of summarising the variation in data;
  5. He devised a technique called composite portraiture which was an early version of photoshop for making montages of photographic portraits;
  6. He pretty much invented the technique of fingerprinting for identifying  individuals by their fingerprints.

In summary, many of the things Galton worked on or invented are still relevant today. And this is just as true for his non-statistical contributions, as for his statistical ones. Of course, it’s an unfortunate historical footnote that his theory of eugenics – social engineering to improve biological characteristics in populations – was adopted and pushed to extremes in Nazi Germany, with unthinkable consequences.

In retrospect, it’s a pity he didn’t just stop once he’d invented the bean machine.


Famous statisticians: John Tukey


This is the first in a (possibly short)  series of posts giving biographical details of famous statisticians from history. There are many possibilities here, but I’ll generally limit posts to statisticians who’ve actually done something really interesting, especially outside of the traditional world of statistics.

With that in mind, the first entry in the series is John Tukey. Tukey was an American statistician, born in 1915, but died in 2000. He was really at the forefront of modern statistics, moving the subject on from very traditional topics like hypothesis testing and decision theory, to a more comprehensive, exploratory version of the discipline, relying heavily on techniques that were only made possible by 20th century advances in computing. You may have heard of a boxplot, as a method for summarising data measured on a single variable, or more usefully for comparing data on several variables. Well, the boxplot was invented – along with many other useful techniques in the statistician’s toolbox – by Tukey.

Outside of a strictly statistical context, Tukey was the first person to coin the computing term ‘bit‘ (binary digit). He’s also credited with having been the first person to use the term ‘software‘ to describe a computer program, though the same term had previously been used to describe personnel.

But my reason for introducing Tukey as the first in this series is that some of my favourite quotes in statistics are his. A fairly comprehensive list can be found here. I like this one for example:

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

Clients please take note. But my absolute favourite Tukey quote, and my reason for making him the first in this series of famous statisticians, is this quote:

The best thing about being a statistician is that you get to play in everyone’s backyard.

What this means is that statisticians are needed everywhere – by scientists, in government, by doctors, in industry, in finance and indeed by gamblers – and with a standard toolkit of statistical techniques statisticians get to work in many different fields; in other words, in everyone’s backyard. The corollary to this is that techniques developed for use in one backyard, are often just as useful (and sometimes even more useful) in a completely different backyard. In particular, none of the techniques we use for our sports models were originally developed with sports in mind: all of them came from someone else’s backyard.

So, I make no apology for the diverse nature of this blog. The common thread is statistics, but I’ll aim to cover as many backyards as possible.