Random sampling


A recurrent theme in the COVID-19 posts to this blog is the difficulty in interpreting data and analyses due to the way data are collected. Different countries have completely different protocols for testing for the disease, and these protocols have also changed through time. Even the reported death counts are unreliable as a measure of the disease effects.

In one earlier post I mentioned two case studies where entire – though limited – populations had been tested: Vò, in northern Italy and the Diamond Princess cruise ship. Since these entire populations were studied, the data are 100% complete, but they are special cases, since they are closed populations where an outbreak of the epidemic is known to have occurred. But what about entire countries? It’s obviously impractical – at least in present circumstances – to test an entire population.

A statistically valid alternative in this case is random sampling – testing individuals randomly selected from the entire population. The proportion testing positive in the sample provides an estimate of the proportion in the entire population, and the bigger the sample, the better the estimate. Obviously there are logistical difficulties in testing genuinely randomly selected individuals, so various practical modifications are often implemented which have to be correct for in the analysis. But the principle is the same: to use information from a randomly selected sample of individuals to estimate the population level.

A study of this type has now been carried out for Austria. Full details of the analysis can be found here. In summary:

  • 1,544 individuals were included in the study which was carried out in the first week of April;
  • These individuals were identified by a stratification procedure: 249 Austrian districts were randomly selected; households in those districts were randomly selected; individuals within those households were randomly selected;
  • Such individuals were invited to participate in the study – the acceptance rate was 77%;
  • Hospitalised individuals were excluded from the study;
  • Final results were adjusted to correct for various factors including household size, gender and age.

The conclusion, after the correction for age, gender and other effects, is that the COVID-19 infection rate in the sample was 0.33%.

Now, bearing in mind that one solution to the epidemic is that a large number of the population acquire the disease, so building a ‘herd immunity‘, the figure of 0.33% is disappointingly small. Even allowing for sampling error, the true value in the population is predicted in the study to be at most 0.76%, whereas it’s thought that herd immunity will require around 60-80% of the population to have been infected.

However, this figure of 0.33% is just a snapshot in time of people who currently have the virus; it doesn’t say anything about the proportion of people who have had the virus – perhaps asymptomatically – and recovered. That figure, which is the figure of interest when discussing herd immunity, is bound to be bigger. But it’s impossible from this study to say by how much.

Moreover, extrapolating the 0.33% to the entire population of Austria would imply around 28,500 positive cases. By contrast, the number of active cases in Austria (as of today, 11 April) is recorded as 6,608:


So, even as low an estimate as 0.33% for the countrywide infection rate implies a roughly four-fold increase in the number of cases above and beyond the official numbers.


Why what we think we know is wrong

A recent Guardian article explained why many of the data that are reported about the spread of Coronavirus are bound to be wrong.

As discussed in previous posts – here, for example – a difficulty when studying data from the Coronavirus pandemic is the reliability and completeness of the data. This is especially true when looking at the number of confirmed cases, since this measure depends very much on the protocol used for testing. However, it’s also true for the number of fatalities. The following tweet leads to a thread by James Tozer of the Economist who has collated evidence from journalists across Europe that suggests the number of deaths due to COVID-19, at least indirectly, might be around double the officially reported number.

The thread is also available in complete form here.

The analysis is pretty much summarised in the following graphic:

The diagrams take different forms, but in each case there’s a black or grey level that corresponds to the number of deaths that would be expected in that period based on previous year’s data. The red level then shows the number of reported deaths due to COVID-19 in that period this year. And the pink region shows the total number of deaths due to any cause, again in the same period this year.

And it’s similar in the United States. The following graph – provided on Twitter by @Tangotiger – shows the excess number of deaths per month in New York, compared to the long-term average, over a period from 2000 onwards.

There is a large spike in September 2001 caused by the 9/11 tragedy. But there is a much larger spike for March/April 2020. But only around 60% of that excess is due to officially recorded COVID-19 related causes. So what explains the other 40%. It’s too large to be explained by random variation – compare its size to the variations that you see in other months over the same period – so it must be due to some specific effect in 2020, for which the only plausible explanation is COVID-19. That’s not to say that all of these deaths are directly attributable to the Coronavirus, though almost certainly many are from people who were positive but not tested. Others, though, are likely to be due to people dying from illnesses that in normal circumstances would have been treatable with medical support.

So, as with all data that are generated by this pandemic, what we think we know about fatality counts is almost certainly wrong. A reasonable run-of-thumb is to take the officially published numbers and double them.


Seasonal effects

There’s been plenty of speculation (here, here, here,…) that the novel Coronavirus might be seasonal, meaning that transmission rates will reduce significantly in the warmer summer months in temperate countries. This would help significantly in controlling the current epidemic wave, potentially buying considerable time in allowing vaccine development or other exit strategies from current lockdown conditions.  But so far there’s been little direct evidence that the Coronavirus is genuinely seasonal.

However, the following tweet links to a statistical analysis which, though circumstantial, provides reason to believe in a seasonal effect. The author of the study looked per-capita death rates due to COVID-19 in individual counties of the United States. They then fitted a regression model using demographic and climate-based statistics as potential explanatory variables for differences in county-to-county rates. What emerged is that temperature is the most significant factor. That’s to say, after allowance for other explanatory factors, the one that had the most impact was temperature: in counties with higher average temperature, everything else being equal, the per-capita death rated to COVID-19 was lower.

Of course, there are all sorts of caveats – see discussion here –  about extrapolating from the conclusions of the type here to assuming seasonality in the worldwide transmission behaviour of the virus. But it is, at least, another reason to be cheerful optimistic.

Risky talk

In a previous post I referenced a book by the eminent statistician David Spiegelhalter. Since earlier this year, Davis has also been producing a podcast ‘Risky Talk‘ on the relevance of Statistics for various issues of public interest. The latest of these is titled ‘Coronavirus: Understanding the Numbers’ and is full of useful information and discussion. It includes, among other things, a discussion of:

  1. Which data are most reliable for understanding the epidemic;
  2. How the different approaches to the epidemic adopted in Norway and Sweden provide a live experiment for assessing the impact of social controls;
  3. A comparison of the seriousness of COVID-19 relative to other flu-like illnesses in the UK;
  4. A discussion of the personal risk we all carry of dying from COVID-19 and other causes.

It’s a great listen and there’s probably nobody qualified to be explaining these issues


The graph above is the latest (as of 5th April) update from the FT showing a 7-day rolling average of the number of new COVID-19 confirmed cases through time for a number of countries. The point of using a 7-day rolling average – which means each value is the average of the previous 7-days’ values – is to reduce the effect of randomness in day-to-day variations, so as to get a smoother picture of trends. As discussed in previous posts, it’s possibly misleading to use confirmed cases as a strict measure of the epidemic scale, since the number of confirmed cases will depend in part on the protocol for testing, which varies from country to country, and even within each country through time. Nonetheless, it’s likely to be broadly interpretable as an indicator of epidemic strength.

Notwithstanding this issue, if the epidemic were growing exponentially in any country, the graph would show as a straight line on this logarithmic scale. To a greater or lesser extent, the curves for almost all countries show a tendency to flatten through time, especially from the time that social measures have been applied to limit potential transmissions through contact. The curve for the UK remains stubbornly close to linear, but its lockdown was introduced later – in relative terms – than for most other European countries. The curve for Italy seems to have flattened quicker than for other countries – again relative to when the country was placed on lockdown – but that’s probably because severe local restrictions were placed on the worst-affected regions some time before the entire country was placed on lockdown.

But anyway…. the point I wanted to make in this post is a little different. There are several reasons why it’s a good idea to use a logarithmic scale in graphs like the one above. Mostly this is because there are good epidemiological reasons to believe – as discussed here –  that an unchecked epidemic will grow exponentially. And exponential growth on a logarithmic scale will appear linear, which makes comparisons and contrasts much easier. But one disadvantage of the logarithmic scale in this context is that it can give a false impression as to the degree of similarity between countries. Looking at the above graph, it’s true that the trajectory for the United States looks currently worse than that for other countries, but not so much worse. But now look at the same graph, from a day or two earlier, on a linear, instead of logarithmic, scale:

On this scale the difference in trajectory for the United States relative to each of the other countries is much more apparent. The current level is very much greater, while the tendency for growth is also considerably more dramatic.

In summary, different scales for graphs are useful for different purposes. And though the logarithmic scale is better than a linear scale for most purposes in tracking an epidemic, it’s only once you put things back on a linear scale that you get a true sense of how different the epidemic currently is on the ground in different countries.


A changing world

In an earlier post, I discussed the ‘stringency index’, which has recently been developed as a way of measuring how severe – stringent – a country’s response has been to the Coronavirus epidemic.

The Financial Times, as part of its live coronavirus coverage, has now produced the following animated world map of the stringency index from the start of the year up to 24 March:

It’s striking how most of the world outside of China stays blue for most of February – arguably time thrown away – and how rapidly most of the world turns red and purple from the middle of March.

As an aside, the tweet below contains a great video where John Burn-Murdoch of the FT explains several of the decisions made by his team in the way they have chosen to present graphs showing the scale of the epidemic across countries:

Sex and the Coronavirus

Actually, not in that sense, but you can find relevant information here.

For good and for bad, the Coronavirus epidemic is generating a large amount of data. And as more data become available, Statistics plays its part in understanding the virus in terms of its mechanisms of transmission and spread.

One very obvious aspect of the original Chinese data – described in an academic paper in the Lancet – which has subsequently been confirmed as data from other countries became available, is a difference in death rates for infected males and females. The rate of contagion for males and females is broadly similar, as shown in the following diagram

The slight difference in rate of infection between the sexes has also been subsequently observed in other countries – males always having a slightly higher infection rate – so although the difference is slight, it’s likely to be a genuine phenomenon rather than a random effect due to small amounts of data.

But in any case, this difference in infection rates pales into comparison when comparing death rates for males and females. In the original Lancet paper the ratio of male to female Coronavirus deaths is reported as 73% : 27%. So if you’re a male, does this place you in a higher risk category?

Not necessarily. In pre-coronavirus days, various posts in this blog – for example here – discussed the way that an apparent effect, such as death rates varying according to an individual’s sex, could actually be due to an entirely different phenomenon. In particular, smoking rates among men in China are very much higher than those of women. And since almost all deaths due to Coronavirus occur via failures of the respiratory system, it was hypothesised that the increased death rates among men was actually a consequence of smokers being at higher risk.

Unfortunately for men – though not for smokers – this hypothesis has been found to be unsupported by data from other countries. Based on the latest available data from all countries, the death rates for males and females who contract COVID-19 are given by the following table:

The fatality rates are different depending on whether you look at confirmed or unconfirmed cases, but in each case the ratio of fatalities of males to females is around 62% : 38%. This is a less extreme ratio than was found from the Chinese data, but since this now includes data from countries where the difference in smoking rates between males and females is much smaller than for China, it implies that smoking is not the only issue. It might explain why the ratio is worse for China than elsewhere, but it can’t be the whole story.

This New York Times article based on the Italian data points out that previous coronavirus epidemics such as SARS and MERS also led to higher fatality rates among males, and argues this is likely to be due to women having generally stronger immune defence systems due to genetics.


  1. Various newspaper articles have discussed this phenomenon: here, for a discussion of the Italian data; here for a discussion of the Spanish data.
  2. The Lancet paper referred to above was published on 29 January. It concluded

We have to be aware of the challenge and concerns brought by 2019-nCoV to our community. Every effort should be given to understand and control the disease, and the time to act is now.

Of course, it’s easy to be wise after the event. But the Lancet paper was wise before any of the events outside of China had taken place.

Update: There’s new evidence from data in the UK which adds weight to the difference in death rates for males and females due to COVID-19. A discussion of these data, repeating many of the points made in the post above, is available here.

Stay strong, stay at home

This is a quick follow-up to yesterday’s post, ‘Reasons to be cheerful’. I suggested there that looking at the same data in different ways can give you an alternative perspective on things. Specifically, I showed how looking at the rate of change in the number of new Coronavirus cases leads to a more optimistic view of how the epidemic is being brought under control, compared to just looking at the cases.

The following graph is like that of the previous post, but now showing the number of deaths due to Coronavirus through time in the worst affected countries.

Again, for most countries, there is some slight flattening of the curves, but if you live in a country like Italy, it’s difficult to see much encouragement that things are actually improving, despite the country now having been in total lockdown for 3 weeks.

But, in a series of very helpful tweets, Julia Steinberger who is professor in social ecology and ecological economics at the University of Leeds presented the data in different way, shedding a different light on things. Her graph, shown below, plots the current doubling rate of new Coronavirus deaths  against the total number of deaths. The doubling rate is the number of days it will take the number of deaths to double if the current rate of deaths is maintained. So, the higher the value, the better the epidemic is being contained.

Looked at this way:

  • Improvements in Spain, Italy and especially China, where social restrictions have been in place longest, are evident.
  • In the US, where the potential scale of the epidemic was initially underestimated, the doubling rate decreased for some time, and has only recently started to climb.
  • In the UK, after initially climbing, the doubling rate has actually been declining, though the number of deaths in the last couple of days since the graph was produced have been lower, so the doubling rate has actually increased in recent days. Based on today’s numbers, the current doubling rate is around 4.9 days for the UK.

Updating to the most recent numbers for other countries as well, we find the current doubling rate for the US is around 4.6 days, for Spain it’s 5.4 days and for Italy it’s 9.53 days. In other words, it’s improving everywhere.

Admittedly, the picture is a bit more noisy than that of the previous post,  partly because there are fewer deaths than cases, and also because these are daily values rather than weekly averages. But in any case the message is clear, especially once numbers are updated using the most recent data: social restrictions are working and numbers are improving, even if it’s difficult to see from the original plot. Re-interpreting the numbers in terms of doubling rates gives a much more optimistic picture of how the epidemic is being brought under control.

In summary: stay strong, stay at home. It does work.

It’s probably best to be a little cautious when interpreting the recent improvement in the UK numbers. Legally binding social restrictions have only been in place for a week, which is too short a time for effects to show up in the numbers of fatalities. So, whatever improvements there have been in numbers in the last couple of days is not due to government restrictions. It’s possible, however, that people’s behaviour patters had changed in advance of the formal government restrictions being announced, and this is what’s driving the improvement in numbers. It’s also possible, however, that the improvement is due to a combination of noise and changes in the way the data are being collated. We’ll get a clearer picture in the next few days once more data become available.

Reasons to be cheerful…

Ok, not cheerful exactly, but optimistic.

Often, looking at the same data in a different way can give a completely different perspective on things. The following graph is the updated number of reported Coronavirus cases country-by-country through time.

A few comments:

  • The graph for each country is shifted so that time is measured from the first date on which 1000 cases were reported in that country. In this way the graph for each country is starting at roughly the same level.
  • The graph is on a logarithmic scale, meaning that exponential growth as discussed in earlier posts, would show up as a straight line on this graph.
  • Almost all countries display exponential growth at the start of the epidemic followed by a flattening, Both the rate of exponential growth and tendency to flatten varies from country to country.
  • Despite the lockdowns and other restrictions imposed in many countries in recent weeks, it’s hard to convince yourself that there’s been any noticeable improvement.

And yet… based on the same data – albeit half a day later or so – the following graph shows the percentage increase in new cases – averaged over the previous week to minimise the effect of random day-to-day changes.

For almost all of the countries, the daily percentage increase in cases has fallen and is continuing to fall. In Italy, for example, the daily increase has gone down from around 19% to 8% in the space of a couple of weeks. The trend in the UK is also downwards, but by a smaller amount. However, enforced social controls have only been in place in the UK for less than a week.

One slight caveat is that the information from these graphs is limited to confirmed cases. This means that:

  1. The numbers themselves are bound to be an underestimate of the number of infected individuals in a country;
  2. Comparisons between countries are complicated by the fact that some countries are testing many more individuals than others;
  3. And the trajectory for each country is also complicated by possible changes in testing protocols as the epidemic has evolved.

Nonetheless, the overall trends in these graphs are likely to be broadly indicative of a slowing of the epidemic in almost all countries. The picture for the US is especially complicated however due to wide scale state-by-state differences in testing protocols, that are also changing rapidly in time.

Of course, what we’re seeing in terms of changes in growth rate is also present in the graph above on case numbers. The almost linear reduction in growth rates is due to the slight flattening of the curves of the case numbers on a logarithmic scale. It’s simply that looking at the data this way, the daily changes are highlighted and we get a more realistic – even optimistic – picture of how, despite daily numbers of cases that seem persistently high, the projection for a couple of weeks time is that the rate of new cases will be totally manageable.

So, be optimistic, cheerful even, that the social restrictions are having an effect on the epidemic growth and that there is light at the end of the tunnel.

And if you need them,  here are many more reasons to be cheerful, curated by David Byrne, no less.


The World Health Organisation officially declared the current Coronavirus outbreak a pandemic on 12 March.  A pandemic is technically defined as:

… new disease for which people do not have immunity spreads around the world beyond expectations…

though this is largely subjective, which is why the declaration for the current outbreak was not made till 12 March. But even before that date, most countries realised the Coronavirus epidemic was already on their doorsteps and needed some kind of response.

But how rapid and how stringent have different countries been in their responses?

This is the subject of a new tracker which monitors how different governments have responded to the crisis according to the number of cases they presently have in their country. Specifically, they define something called a stringency index which records, on a scale of 0 to 100, how stringent a country’s measures are. Full details of the definition of the stringency index and the methodology used are available here. Broadly speaking, the more restrictive and widespread a country’s measures, the greater the value of the index. However, the index does not measure how effective the measures are, nor how strictly they are applied or followed.

The tracker is live, which means it is regularly updated. However, as of 24 March, a summary of the way 6 different countries have responded to the crisis is contained in the following figure:

For each country, time is measured in days since the first case appeared in that country, and the black curve shows the trajectory of the epidemic in terms of number of cases. (Bear in mind though that the number of cases is also related to the number of tests carried out, so direct comparison of these curves across countries may not be entirely valid).

The red dots show the value of the stringency index on the same timescale. You need to look at the right-hand axis to read-off the actual values of the index. For all countries the stringency index has generally risen as the epidemic has grown: countries have responded to the crisis by bringing in measures to control the virus spread. But there are significant differences across the different countries:

  • In France and especially Italy, the stringency index follows the trajectory of the epidemic very closely. In other words, governments there have responded quickly to the scale of the epidemic as it has grown.
  • In South Korea, where the epidemic has been largely controlled, the stringency measure values increase ahead of the growth of the epidemic. That’s to say, the government has anticipated the growth of the epidemic and brought disease control measures in quickly to stop the epidemic growth before it occurred.
  • The United Kingdom’s first use of restrictive measures was very slow, and they have since been playing catch-up relative to the size of the epidemic.
  • In the US, there was almost no attempt at control until long after the start of the epidemic. Belatedly, more stringent measures have been applied, but these are still substantially less restrictive than those of France or Italy.
  • China’s pattern is more complicated. Since they were the first country affected by the outbreak, it’s perhaps understandable that their initial response was slow. Their subsequent response was rapid, though, enabling a subsequent reduction in stringency, which has more recently been raised again – presumably in an attempt to prevent a second wave of the epidemic. The maximum stringency index is considerably lower than that of France or Italy, presumably because although their measures were more restrictive, they were localised in severity to the hardest-hit province of Hubei.

One might quibble about the actual definitions used for the stringency index, but these conclusions broadly chime with common perceptions about the efficacy of different government responses to the epidemic.