One of the most remarkable advances in scientific understanding over the last couple of hundred years has been Mendelian genetics. This theory explains the basics of genetic inheritance, and is named after its discoverer, Gregor Mendel, who developed the model based on observations of the characteristics of peas when cross-bred from different varieties. In his most celebrated experiment, he crossed pure yellow with pure green peas, and obtained a generation consisting of only yellow peas. But in the subsequent generation, when these peas were crossed, he obtained a mixed generation of yellow and green peas. Mendel constructed the theory of genes and alleles to explain this phenomenon, which subsequently became the basis of modern genetic science.

You probably know all this anyway, but if you’re interested and need a quick reminder, here’s a short video giving an outline of the theory.

Mendel’s pea experiment was very simple, but from the model he developed he was able to calculate the proportion of peas of different varieties to be expected in subsequent generations. For example, in the situation described above, the theory suggests that there would be no green peas in the first generation, but around 1/4 of the peas in the second generation would be expected to be green.

Mendel’s theory extends to more complex situations; in particular it allows for the inheritance of multiple characteristics. In the video, for example, the characteristic for peas to be yellow/green is supplemented by their propensity to be round/wrinkled. Mendel’s model leads to predictions of the proportion of peas in each generation when stratified by both these characteristics: round and green, or yellow and wrinkled etc etc.

The interesting thing from a statistical point of view is the way Mendel verified his theory. All scientific theories go through the same validation process: first there are some observations; second those observations lead to a theory; and third there is a detailed scrutiny of further observations to ensure that they are consistent with the theory. If they are, then the theory stands, at least until there are subsequent observations which violate the theory, or a better theory is developed to replace the original.

Now, where there is randomness in the observations, the procedure of ensuring that the observations are in agreement with the theory is more complicated. For example, consider the second generation of peas in the experiment above. The theory suggests that, on average, 1/4 of the peas should be green. So if we take 100 peas from the second generation, we’d expect around 25 of them to be green. But that’s different from saying exactly 25 should be green. Is it consistent with the theory if we get 30 green peas? Or 40? At what point do we decide that the experimental evidence is inconsistent with the theory? This is the substance of Statistics.

Actually, the theory of Mendelian inheritance can be expressed entirely in terms of statistical models. There is a specific probability that certain characteristics are passed on from parents to offspring, and this leads to expected proportions of different types in subsequent generations. And expressed this way, we don’t just learn that 1/4 of second generation peas should be green, but also the probability that in a sample of 100 we get 30, 40 or any number of green peas.

And this leads to something extremely interesting: Mendel’s experimental results are simply too good to be true. For example – though I’m actually making the numbers up here – in repeats of the simple pea experiment he almost always got something very close to 25 green peas out of 100. As explained above, the statistics behind Mendelian inheritance do indeed say that he should have got an average of 25 per population. But the same theory also implies that 20 or 35 green peas out of 100 are entirely plausible, and indeed a spread of experimental results between 20 and 35 is to be expected. But, each of Mendel’s experiments gave a number very close to 25. Ironically, if these really were the experimental results, they would be in violation of the theory, which expects not just an average of 25, but with an appropriate amount of variation around that figure.

So, Mendel’s experimental results were actually a primitive example of fake news. But here’s the thing: Mendel’s theory has subsequently been shown to be correct, even if it seems likely that the evidence he presented had been manipulated to strengthen its case. In modern parlance, Mendel focused on making sure his results supported the predicted average, but failed to appreciate that the theory also implied something about the variation in observations. So even if the experimental results were fake news, the theory itself has been shown to be anything but fake.

To be honest, there is some academic debate about whether Mendel cheated or not. As far as I can tell though, this is largely based on the assumption that since he was also a monk and a highly-regarded scientist, cheating would have been out of character. Nobody really denies the fact that the statistics really are simply too good to be true. Of course, in the end, it really is all academic, as the theory has been proven to be correct and is the basis for modern genetic theory. If interested, you can follow the story a little further here.

Incidentally, the fact that statistical models speak about variation as well as about averages is essential to the way they get used in sports modelling. In football, for example, models are generally estimated on the basis of the average number of goals a team is expected to score. But the prediction of match scores as a potential betting aid requires information about the variation in the number of goals around the average value. And though Mendel seems not to have appreciated the point, a statistical model contains information on both averages and variation, and if a model is to be suitable for data, the data will need to be consistent with the model in terms of both aspects.

## 2 thoughts on “The gene genie”