# Cube-shaped poo

Do you like pizza? If so, I’ve got good and bad news for you.

The good news is that the 2019 Ig Noble prize winner in the category of medicine is Silvano Gallus, who received the award for…

… collecting evidence that pizza might protect against illness and death…

The bad news, for most of you, is that this applies…

…if the pizza is made and eaten in Italy.

Obviously, it’s a bit surprising that pizza can be considered a health food. But if you accept that, it’s also a bit surprising that it has to be Italian pizza. So, what’s going on?

The Ig Nobel prizes are a satirical version of the Nobel prizes. Here’s the Wikipedia description:

The Ig Nobel Prize (/ˌɪɡnˈbɛl/ IG-noh-BEL) is a satiric prize awarded annually since 1991 to celebrate ten unusual or trivial achievements in scientific research, its stated aim being to “honor achievements that first make people laugh, and then make them think.” The name of the award is a pun on the Nobel Prize, which it parodies, and the word ignoble.

As such, the prize is awarded for genuine scientific research, but for areas of research that are largely incidental to human progress and understanding of the universe. For example, this year’s prize in the field of physics went to a group of scientists for…

It’s in this context that Silvano Gallus won his award. But although the Ig Noble award says something about the irrelevance of the subject matter, it’s not intended as a criticism of the quality of the underlying research. Gallus’s work with various co-authors (all Italian) was published as an academic paper ‘Does Pizza Protect Against Cancer‘ in the International Journal of Cancer. This wouldn’t happen if the work didn’t have scientific merit.

Despite this, there are reasons to be cautious about the conclusions of the study. The research is based on a type of statistical experimental design known as a case-control study. This works as follows. Suppose, for argument’s sake, you’re interested in testing the effect of pizzas on the prevention of certain types of disease. You first identify a group of patients having the disease and ask them about their pizza-eating habits. You then also find a group of people who don’t have the disease and ask them about their pizza-eating habits. You then check whether the pizza habits are different in the two groups.

Actually, it’s a little more complicated than that. It might be that age or gender or something else is also different in the two groups, so you also need to correct for these effects as well. But the principle is essentially just to see whether the tendency to eat pizza is greater in the control group – if so, you conclude that pizza is beneficial for the prevention of the specified disease. And on this basis, for a number of different cancer-types, Silvano Gallus and his co-authors found the proportion of people eating pizzas occasionally or regularly to be higher in the control group than in the case group.

Case-control studies are widely used in medical and epidemiological studies because they are quick and easy to implement. The more rigorous ‘randomised control study’ would work as follows:

1. You recruit a number of people for the study, none of whom have the disease of interest;
2. You randomise them into two groups. One of the groups will be required to eat pizza on a regular basis; the other will not be allowed to eat pizza;
3. You follow the 2 groups over a number of years and identify whether the rate of disease turns out to be lower in the pizza-eating group rather than the non-pizza-eating group;
4. Again, you may want to correct for other differences in the 2 groups (though the need for this is largely eliminated by the randomisation process).

Clearly, for both logistic and time reasons, a randomised control study is completely unrealistic for studying the effects of pizza on disease prevention. However, in terms of reliability of results, case control studies are generally inferior to randomised control studies because of the potential for bias.

In case control studies the selection of the control group is extremely important, and it might be very easy to fall into the trap of inadvertently selecting people with an unusually high rate of eating pizzas. (If, for example, you surveyed people while standing outside a pizzeria). It’s also easy – by accident or design – for the researcher to get the answer they might want when asking a question. For example: “you eat a lot of pizza, don’t you?” might get a different response from “would you describe yourself as a regular pizza eater?”. Moreover, people simply might not have an accurate interpretation of their long-term eating habits. But most importantly, you are asking people with, for example, cancer of the colon whether they are regular pizza eaters. Quite plausibly this type of disease has quite a big effect on diet, and one can well imagine that pizzas are not advised by doctors. So although the pizza-eating question is probably intended to relate to the period prior to getting the disease, it’s possible that people with the disease are no longer tending to eat pizza, and respond accordingly.

Finally, even if biases are eliminated by careful execution of the study, there’s the possibility that the result is anyway misleading. It may be that although pizzas seem to give disease protection, it’s not the pizza itself that’s providing the protection, but something else that is associated with pizza eating. For example, regular pizza eating might just be an indicator of someone who simply has regular meals, which may be the genuine source of disease protection. There’s also the possibility that while the rates of pizza eating are lower among the individuals with the specified diseases, they are much higher among individuals with other diseases (heart problems, for example). This could have been identified in a randomised control study, but flies completely under the radar in a case-control study.

So, case-control studies are a bit of a minefield, with various potential sources of misleading results, and I would remain cautious about the life-saving effects of eating pizza.

And finally… like all statistical analysis, any conclusions made on the basis of sample results are only relevant to the wider population from which that sample was drawn. And since this study was based on Italians eating Italian pizzas, the authors conclude…

Extension of the apparently favorable effect of pizza on cancer risk in Italy to other types of diets and populations is therefore not warranted.

So, fill your boots at Domino’s Pizzas, but don’t rely on the fact that this will do much in the way of disease prevention.

Stick a monkey on a typewriter, let him hit keys all day, and what will you get? Gibberish, probably. But what if you’re prepared to wait longer than a day? Much longer than a day. Infinitely long, say. In that case, the monkey will produce the complete works of Shakespeare. And indeed any and every other work of literature that’s ever been written.

This is from Wikipedia:

The infinite monkey theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type any given text, such as the complete works of William Shakespeare.

Infinity is a tricky but important concept in mathematics generally. We saw the appearance of infinity in a recent post, where we looked at the infinite sequence of numbers

1, 1/2, 1/4, 1/8,….

and asked what their sum would be. And it turned out to be 2. In practice, you can never really add infinitely many numbers, but you can add more and more terms in the sequence, and the more you add the closer you will get to 2. Moreover, you can get as close to 2 as you like by adding sufficiently many terms in the sequence. It’s in this sense that the sum of the infinite sequence is 2.

In Statistics the concept of infinity and infinite sums is equally important, as we’ll discuss in a future post. But meantime… the infinite monkey theorem. What this basically says is that if something can happen in an experiment, and you repeat that experiment often enough, then eventually it will happen.

Sort of. There’s still a possibility that it won’t – the monkey could, by chance, just keep hitting the letter ‘a’ totally at random forever, for example – but that possibility has zero probability. That’s the ‘almost surely’ bit in the Wikipedia definition. On the other hand, with probability 1 – which is to say complete certainty – the monkey will eventually produce the complete works of Shakespeare.

Let’s look at the calculations, which are very similar to those in another recent post.

There are roughly 50 keys on a keyboard, so assuming the monkey is just hitting keys at random, the probability that the first key stroke matches the first letter of Shakespeare’s works is 1/50. Similarly, the probability the second letter matches is also 1/50. So to get the first two matching it’s

$1/50 \times 1/50$

Our monkey keeps hitting keys and at each new key stroke, the probability that the match-up continues is multiplied by 1/50. This probability gets small very, very quickly. But it never gets to zero.

Now, if the monkey has to hit N keys to have produced a text as long as the works of Shakespeare, by this argument he’ll get a perfect match with probability

$p=(1/50)^N$

This will be a phenomenally small number. Virtually zero. But, crucially, not zero. Because if our tireless monkey repeats that exercise a large number of times, let’s say M times, then the probability he’ll produces Shakespeare’s works at least once is

$Q = 1-(1-p)^M$

And since p is bigger than zero – albeit only slightly bigger than zero –  then Q gets bigger with N. And just as the sum of the numbers 1, 1/2, 1/4, … gets closer and closer to 2 as the number of terms increases, so Q can be made as close to 1 as we like by choosing M large enough.

Loosely speaking, when M is infinity, the probability is 1. And even more loosely: given an infinite amount of time our monkey is bound to produce the complete works of Shakespeare.

Obviously, both the monkey and the works of Shakespeare are just metaphors, and the idea has been expressed in many different forms in popular culture.  Here’s Eminem’s take on it, for example:

# The China syndrome

In a couple of earlier posts I’ve mentioned how statistical analyses have sometimes been used to demonstrate that results in published analyses are ‘too good to be true’. One of these cases concerned Mendel’s laws of genetic inheritance. Though the laws have subsequently been shown to be unquestionably true, Mendel’s results on pea experiments were insufficiently random to be credible. The evidence strongly suggests that Mendel tweaked his results to fit the laws he believed to be true. He just didn’t understand enough about statistics to realise that the very laws he wanted to establish also implied sizeable random variation around predicted results, and the values he reported were much too close to the predicted values to be plausible.

As discussed in a recent academic article, a similar issue has been discovered in respect of official Chinese figures for organ donation. China has recently come under increasing international pressure to discontinue its practice of using organs of dead prisoners for transplants. One issue was consent – did prisoners consent to the use of their organs before their death? But a more serious issue was with respect to possible corruption and even the possibility that  some prisoners were executed specifically to make their organs available.

Anyway, since 2010 China has made efforts to discontinue this practice, replacing it with a national system of voluntary organ donation. Moreover, they announced that from 2015 onwards only hospital-based voluntary organ donations would be used for transplants.  And as evidence of the success of this program, two widely available datasets published respectively by the China Organ Transplant Response System (COTRS)  and the Red Cross Society of China, show rapid growth in the numbers of voluntary organ donations, which would more than compensate for the cessation of the practice of donations from prisoners.

Some of the yearly data counts from the COTRS database are shown in this figure taken from the report references above. The actual data are shown by points (or triangles and crosses); the curves have been artificially added to show the general trend in the observed data. Clearly, for each of the count types, one can observe a rapid growth rate in the number of donations.

But… here’s the thing… look at how closely the smooth curves approximate the data values. The fit is almost perfect for each of the curves. And there’s a similar phenomenon for other data, including the Red Cross data. But when similar relationships are looked at for data from other countries, something different happens: the trend is generally upwards, as in this figure, but the data are much more variable around the trend curve.

In summary, it seems much more likely that the curves have been chosen, and the data chosen subsequently to fit very closely to the curves. But just like Mendel’s pea data, this has been done without a proper awareness that nature is bound to lead to substantial variations around an underlying law. However, unlike Mendel, who presumably just invented numbers to take shortcuts to establish a law that was true, the suspicion remains that neither the data nor the law are valid in the case of the Chinese organ donation numbers.

A small technical point for those of you that might be interested in such things. The quadratic curves in the above plot were fitted in the report by the method of simple least squares, which aims to find the quadratic curve which minimises the overall distance between the points and the curve. As a point of principle, I’d argue this is not very sensible. When the counts are bigger, one would expect to get more variation, so we’d probably want to downweight the value of the variation for large counts, and increase it for the lower counts. In other words, we’d expect the curve to fit better in the early years and worse in the later years, and we should take that into account when fitting the curve. In practice, the variations around the curves are so small, the results obtained by doing things this way are likely to be almost identical. So, it’s just a point of principle more than anything else. But still, in an academic paper which purports to use the best available statistics to discredit the claim made by a national government, it would probably be best to make sure you really are using the most appropriate statistical methods for the analysis.

# Black Friday

Had you heard today is Black Friday. Or have you been living as a hermit in a cave without phone or access to emails for the last couple of weeks or so?

Like Cyber Monday, Green Monday and Giving Tuesday, Black Friday is a retail event imported from the United States, where it is timed to coincide with the Thanksgiving national holiday period. Sadly, here in the UK, we don’t get the holiday, but we do get the pleasure of a day – which often extends to at least a couple of weeks – indulging ourselves with the luxury of purchasing goods that we probably don’t need at prices that are well below the usual retail price.

Or do we?

The consumer group Which monitored the prices of 83 products that were offered for sale during 2018’s Black Friday event and found:

• 95% of the products were available at the same price or cheaper in the 6 months following Black Friday;
• 61% of the products had been available at the same price or cheaper in the 6 months prior to Black Friday;
• Just 5% of the products were genuinely at their cheapest on Black Friday compared to the surrounding 12-month period.

Obviously 83 products is not a huge sample size, especially since different retailers are likely to have a different pricing strategy, so you shouldn’t read too much into the exact numbers. But the message is clear and probably ties in with your own experience of  the way retailers manipulate shoppers’ expectations during ‘sales’.

Anyway, a fun statistical analysis of various aspects of Black Friday can be found here. I’m not sure how reliable any of the analyses are, especially in light of the Which results, but an example is given in the following figure. This shows – apparently – the sales growth per country on Black Friday compared to a regular Friday.

Now, I don’t know if it’s the number of items sold, the money spent, or something else, but in any case Pakistan supposedly has a retail rate that’s 11525% of a normal Friday rate. That’s to say a sales increase factor of 115. In Italy the factor is 45 and even in the UK the usual Friday rate is multiplied by 15. Impressive if true.

But I’m personally more impressed by Thailand who doggedly spend less than half of a normal Friday’s expenditure on Black Friday. Of course, we can’t tell from these data whether this is due to a genuine resistance to Black Friday, or whether Thailand has a strong seasonal variation in sales such that this time of the year is naturally a period of low sales.

Finally, if you want to empathise with Thailand, you could yourself participate in Buy Nothing Day, intentionally held on the same day as Black Friday. It probably doesn’t need much in the way of explanation, but just in case, here’s the tagline from the webpage:

## It’s time to celebrate Buy Nothing Day!

Maybe someone should pass the message on to Pakistan.

# At The Intersection

You’ll remember Venn diagrams from school. They’re essentially a mathematical tool for laying out the information in partially overlapping sets. And in statistics they are often used in the same way for showing the possible outcomes in events which might overlap.

For example, here’s a Venn diagram showing the relationship between whales and fish:

Whales and fish have some properties that are unique, but they also have some features in common. These are all shown in the appropriate parts of the diagram, with the common elements falling in the part of the sets that overlap – the so-called intersection.

With this in mind, I recently came across the following Venn poem titled ‘At the Intersection’ written by Brian Bilston:

You can probably work it out. There are three poems in total:  separate ones for ‘him’ and ‘her’ and their intersection. Life seen from two different perspectives, the result of which is contained in the intersection.

Genius.

# One-in-a-million

Suppose you can play on either of 2 slot machines:

1. Slot machine A pays out with probability one in a million.
2. Slot machine B pays out with probability one in 10.

Are you more likely to get a payout with one million attempts with slot machine A or with 10 attempts on slot machine B?

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

So, there’s a bigger probability (0.65) that you’ll get a payout from 10 spins of slot machine B than from a million spins of slot machine A (probability 0.63).

Hopefully, the calculations above are self-explanatory. But just in case, here’s the detail. Suppose you have N attempts to win with a slot machine that pays out with probability 1/N.

1. First we’ll calculate the probability of zero payouts in the N spins.

2. This means we get a zero payout on every spin.

3. The probability of a zero payout on one spin is one minus the probability of a win: 1 – 1/N.

4. So the probability of no payout on all the spins is

$(1-1/N)^N$

5. And the probability of at least one payout is

$1- (1-1/N)^N$

As explained in the tweet, with N=10 this gives 0.65 and with N=1,000,000 it gives 0.63. The tweet’s author explains in a follow-up tweet that he was expecting the same answer both ways.

But as someone in the discussion pointed out, that logic can’t be right. Suppose you had one attempt with slot machine C which paid out with probability 1. In other words, N=1 in my example above. Then, of course, you’d be bound to get a payout, so the probability of at least one payout is 1. So, although it’s initially perhaps surprising that you’re more likely to get a payout with 10 shots at slot machine B than with a million shots at slot machine A, the dependence on N becomes obvious when you look at the extreme case of slot machine C.

Footnote: What does stay the same in each case however is the average number of times you will win. With N shots at a slot machine with win probability 1/N, you will win on average once for any choice of N. Sometimes you’ll win more often, and sometimes you may not win at all (except when N=1). But the average number of wins if you play many times will always be 1.

# Juvenile dinosaurs

This blog is mostly about Statistics as a science rather than statistics as numbers. But just occasionally the statistics themselves are so shocking, they’re worthy of a mention.

With this in mind I was struck by two statistics of a similar theme in the following tweet from Ben Goldacre (author of the Bad Science website and book):

Moreover, in the discussion following Ben’s tweet, someone linked to the following cartoon figure:

This shows that even if you change the way of measuring distance from time to either phylogenetic distance or physical similarity, the following holds: the distance between a sparrow and T-Rex is smaller than that between T-Rex and Stegosaurus.

Footnote 1: this is more than a joke. Recent research makes the case that there is a strong evolutionary link between birds and dinosaurs. As one of the authors writes:

We now understand the relationship between birds and dinosaurs that much better, and we can say that, when we look at birds, we are actually looking at juvenile dinosaurs.

Footnote 2. Continuing the series (also taken from the discussion of Ben’s tweet)… Cleopatra is closer in time to the construction of the space shuttle than the pyramids.

Footnote 3. Ben Goldacre’s book, Bad Science, is a great read. It includes many examples of the way science – and Statistics – can be misused.

# Problem solved

A while back I set a puzzle asking you to try to remove three coins from a red square region as shown in the following diagram.

The only rule of the game is that when a coin is removed it is replaced with 2 coins – 1 immediately to the right of and one immediately below the coin that is removed. If there is no space for adding these replacement coins, the coin cannot be removed.

The puzzle actually appeared in a recent edition of Alex Bellos’ Guardian mathematics puzzles, though it was created by the Argentinian mathematician Carlos Sarraute. This is his solution which is astonishing for its breathtaking ingenuity.

The solution starts by giving a value to every square in the grid as follows:

Remember, the grid goes on forever both to the right and downwards. The top left hand box has value 1. Going right from there, every subsequent square has value equal to 1/2 of the previous one. So: 1, 1/2, 1/4, 1/8 and so on. The first column is identical to the first row. To complete the second row, we start with the first value, 1/2, and again just keep multiplying by 1/2. The second column is the same as the second row. And we fill the entire grid this same way. Technically, every row and column is a series of geometric numbers: consecutive multiples of a common number, which in this case is 1/2.

Let’s define the value of a coin to be the value of the square its on. Then the total value of the coins at the start of the game is

$1 + \frac{1}{2} + \frac{1}{2}= 2$

Now…

• When we remove a coin we replace it with two coins, one immediately to the left and one immediately to the right. But if you look at the value any square on the grid, it is equal to the sum of the values of the squares immediately below and to the right. So when we remove a coin we replace it with two coins whose total value is the same. It follows that the total value of the coins stays unchanged however many moves we make. It will always be 2 however many moves we make.
• This is the only tricky mathematical part. Look at the first row of numbers. It consists of 1, 1/2, 1/4, 1/8… and goes on forever. But even though this is an infinite sequence it has a finite sum of 2. Obviously, we can never really add infinitely many numbers in practice, but by adding more and more terms in the series we will get closer and closer to the value of 2. Try it on a calculator. In summary:

$1 + \frac{1}{2} + \frac{1}{4} + \frac{1}{8} +\ldots = 2.$

• Working down the rows, the second row is the same as the first with the first term removed. So its sum must be 1. The third is the same as the second with the first term of 1/2 removed, so its sum is 1/2. By the same reasoning, the sum of the fourth row will be 1/4, the fifth row 1/8 and so on.
• So, the row sums are respectively 2, 1, 1/2, 1/4, …. This is the same as the values of the first row with the additional first term of 2. It follows that the sum of the row sums, and therefore the sum of all numbers in the grid is 2+2=4. Again, we can’t add all the numbers in the practice, but we will get closer and closer to the value of 4 by adding more and more squares.
• The total value of the squares inside the red square is 1 + 1/2 + 1/2 + 1/4 = 9/4. The total value outside this region must therefore be 2-9/4= 7/4.
• Putting all this together, the initial value of the coins was 2. After any number of moves, the total value of all coins will always remain 2. But the total value of all squares outside the red square is only 7/4. It must therefore be impossible to remove the three coins from the red square because to do so would require the coins outside of this area to have a value of 2, which is greater than the total value available in the entire region.

I find this argument quite brilliant. My instincts were to try to solve the puzzle using arguments from geometry. I failed. It would never have occurred to me to try to develop a solution based on the properties of numbers.

As I wrote in the original post, this puzzle doesn’t really have any direct relevant to Statistics except in so much as it shows the power and beauty of mathematical proof, which is an essential part of statistical theory. Having said that, the idea of infinite limits is important in Statistics, and I’ll discuss this in a further post. Let me just mention though that summing infinite series as in the solution above is a delicate issue for at least two reasons:

1. Although the sum 1 + 1/2 + 1/4 + 1/8 + …. has a finite sum of 2, this series 1 + 1/2 + 1/3 + 1/4 + 1/5 + …. has no finite sum. The sum grows very slowly, but as I take more and more numbers in the series, the sum grows without any limit. That’s to say, if you give me any number – say 1 million – I can always find enough terms in the series for the sum to be greater than that number.
2. To get the total value of the grid, we first added the rows and then added these row sums across the columns. We could alternatively have first added the columns, and then added these columns sums across the rows and we’d have got the same answer. For this example both these alternatives are valid. But in general this interchange of row and column sums to get the total sum is not valid. Consider, for example, this infinite grid:

The first row sums to 2, after which all other rows sum to zero. So, the sum of the row sums is 2. But looking at the columns, even column sums to zero. So, if we sum the columns and then sum these sums we get 0. This couldn’t possibly happen with finite grids, but infinite grids require a lot more care.

In a follow-up post we’ll consider limits of sums in the context of Statistics.

Finally, I’m grateful to Fabian.Thut@smartodds.co.uk for some follow-up discussion on the original post. In particular, we ended up discussing the following variation on the original puzzles. The rules are exactly the same as before, but the starting configuration of the coins is now as per the following diagram:

In this case, can the puzzle be solved? Does the argument presented for the original problem help in any way?

If you have any thoughts about this, please do write to me. In any case, I’ll write another post with the solution to this version shortly.

# Coincidentally

Here we go again…

Happy birthday to me. And Harry.Hill@smartodds.co.uk  and Rickie.Reynolds@smartodds.co.uk. And willfletcher1111@gmail.com who also used to be in the quant team. What a remarkable coincidence that 3 of us currently in the quant team – together with another quant who has since left – each have our birthday on the 11th November. But as I discussed in a post around this time last year as well as at a previous offsite, there are so many possible combinations of three or four people in the company that could have a shared birthday, that it’s not that very surprising that one combination does. It just happened to be me, Harry, Rickie and Will on 11/11.

And on the subject of coincidences…

You may have heard of an app called what3words. This is a location app for ios or android which divides the entire globe into 3 x 3 metre squares and assigns 3 words to each square. For example, currently sitting at my desk in the Smartodds office, the 3 allocated words are “insert”, “falls”, “opens”. The idea is that in an emergency I can identify and communicate my unique position to the relevant emergency services by means of just these 3 words. Of course, I could do the same thing with my GPS coordinates, but the point is that standard words are easier to read and communicate in a hurry. And there are already a number of instances in which lives have potentially been saved through use of the app.

And the coincidence? Well, I opened the app in my house the other day with this result:

… and, here in a recent Halloween pic with my Grandson, is my Granddaughter…

… the charmingly awesome Poppy!

Footnote: writing this now, I’m reminded that some time ago, as a follow-up to a post  which also discussed coincidences, Richard.Greene@smartodds.co.uk mailed me about an experience he’d recently had. He described it as follows:

I was listening to the radio one morning, and the presenter mentioned “French windows”. I wasn’t sure at the time what they were, and remember amusing myself as to what made them French exactly – perhaps they come with a beret on top etc…anyway, an hour later, I was watching Frasier over my cornflakes and there was a joke/reference to French windows!

Like the shared birthdays, if you tried to calculate the chance of the mentioning of French windows on both the radio and an episode of Frasier within a short time of one another, the probability would be incredibly remote. But again, we experience so many opportunities for coincidences every day, that although the vast majority don’t happen, one or two inevitably do. And they’re the ones we remember and sometimes ascribe to ‘fate’, ‘destiny’, ‘karma’ etc etc. When in fact it’s just the laws of probability playing out in our daily lives.

Anyway, Richard suggested that an idea for a blog post would be collect and collate ‘coincidences’ of the kind I’ve described here – my experience with what3words; Richard’s with French windows. So, if you’ve recently had, or have in the near future, a coincidental experience of some sort, please send it to me and I’ll include it in a future post.

# Size does matter

Consider the following scenario…

A football expert claims that penalties are converted, on average, with a 65% success rate. I collect a sample of games with penalties and find that the conversion rate of penalties in that sample is 70%. I know that samples are bound to lead to some variation in sample results, so I’m not surprised that I didn’t get a success rate of exactly 70% in my sample. But, is the difference between 65% and 70% big enough for me to conclude that the expert has got it wrong? And would my conclusions be any different if the success rate in my sample had been 80%? Or 90%?

This type of issue is at the heart of pretty much any statistical investigation: judging the reliability of an estimate provided by a sample, and assessing whether its evidence supports or contradicts some given hypothesis?

Actually, it turns out that with just the information provided above, the question is impossible to answer. For sure, a true success rate of 65% is more plausible with a sample value of 70% than it would have been with a sample value of 80% or 90%. And just as surely, having got a sample value of 70%, a true value of 65% is more plausible than a true value of 60%. But, the question of whether the sample value of 70% actually supports or opposes the claim of a true value of 65% is open.

To answer this question we need to know whether a sample value of 70% is plausible or not if the true value is 65%. If it is, we can’t say much more: we’d have no reason to doubt the 65% value, although we still couldn’t be sure – we can never be sure! – that this value is correct. But if the sample value of 70% is not plausible if the population value is 65%, then this claim about the population is likely to be false.

One way of addressing this issue is to construct a confidence interval for the true value based on the sample value. Without getting too much hung up on technicalities, a confidence interval is a plausible range for the population value given the sample value. A 95% confidence interval is a range that will contain the true value with probability 95%; a 99% confidence interval is a range that will contain the true value with probability 99%; and so on.

So why not go with a 100% confidence interval? Well, in most circumstances this would be an interval that stretches to infinity in both directions, and we’d be saying that we can be 100% sure that the true value is between plus and minus infinity. Not very helpful. At the other extreme, a 1% confidence interval would be very narrow, but we’d have virtually no confidence that it contained the true value. So, it’s usual to adopt 95% or 99% confidence intervals as benchmarks, as they generally provide intervals that are both short enough and with high enough confidence to give useful information.

For problems as simple as the one above, calculating confidence intervals is straightforward. But, crucially, the size of the confidence interval depends on the size of the data sample. With small samples, there is more variation, and so the confidence intervals are wider; with large samples there is less variation, and the confidence intervals are more narrow.

The following graph illustrates this for the example above.

• The horizontal axis gives different values for the size of the sample on which the value of 70% was based.
• The horizontal red line shows the hypothesised population value of 65%
• The horizontal green line shows the observed sample value of 70%.
• For each choice of sample size, the vertical blue line shows a 95% confidence interval for the true population value based on the sample value of 70%.

What emerges is that up until a sample size of 300 or so, the 95% confidence interval includes the hypothesised value of 65%. In this case, the observed data are consistent with the hypothesis, which we therefore have no reason to doubt. For larger sample sizes, the hypothesised value falls outside of the interval, and we would be led to doubt the claim of a 65% success rate. In other words: (sample) size does matter. It determines how much variation we can anticipate in estimates, which in turn determines the size of confidence intervals and by extension the degree to which the sample data can be said to support or contradict the hypothesised value.

The story is much the same with 99% confidence intervals, as shown in the following figure.

The intervals are wider, but the overall pattern is much the same. However, with this choice the data contradict the hypothesis only for sample sizes of around 500 or more.

Whether you choose to base your decisions on intervals of confidence 95%, 99% or some other level is a matter of preference. In particular, there are two types of errors we can make: we might reject the hypothesised value when it’s true, or accept it when it’s false. Using 99% intervals rather than 95% will reduce the chance of making the first error, but increase the chance of the second. We can’t have it both ways. The only way of reducing the chances of both of these types of errors is to increase the sample size. Again: size matters.

Footnote: one thing worth noting in the figures above is that the confidence intervals change in length quite quickly when the sample size is small, but more slowly when the sample size is large. This can be stated more precisely: to make a confidence interval half as wide you have to multiply the sample size by 4. So if your sample size is 10, you just need to increase it to 40 to make confidence intervals half as wide; but if the sample size is 100, you have to increase it to 400. It’s a sort of law of diminishing returns, which has important consequences if data are expensive. An initial investment of, say, £100’s worth of data will give you answer with a certain amount of accuracy, but each further investment of £100 will improve accuracy by a smaller and smaller amount. At what point is the cost of potential further investment too great for the benefits it would lead to?