# Fringe benefits

The Edinburgh Fringe Festival is the largest arts festival in the world. The 2019 version has just finished, but Wikipedia lists some of the statistics for the 2018 edition:

1. the festival lasted 25 days;
2. it included more than 55,0000 performances;
3. that comprised 3548 different shows.

The shows themselves are of many different types, including theatre, dance, circus and music. But the largest section of the festival is comedy, and performers compete for the Edinburgh Comedy Awards – formerly known as the Perrier Award – which is given to the best comedy show on the fringe.

I mention all this because the TV Channel Dave also publishes what it regards to be the best 10 jokes of the festival. And number 4 this year was a statistical joke.

Enjoy:

A cowboy asked me if I could help him round up 18 cows. I said, “Yes, of course. That’s 20 cows.”

Confession: the joke is really based on arithmetic rather than Statistics.

You’re at a party and meet someone. After chatting for a bit, you work out that the girl you’re talking to has 2 kids and one of them is a boy. What are the chances she’s got 2 boys rather than a boy and a girl?

Actually, I really want to ask a slightly more complicated question than this. But let’s take things slowly. Please think about this problem and, if you have time, mail me or send me your answer via this form. Subsequently, I’ll discuss the answer to this problem and ask you the slightly more complicated question that I’m really interested in.

# Terrible maps

One of the themes in this blog has been the creative  use of diagrams to represent statistical data. When the data are collected geographically this amounts to using maps to represent data – perhaps using colours or shadings to show how a variable changes over a region, country or even the whole world.

With this in mind I recommend to you @TerribleMaps on twitter.

It’s usually entertaining, and sometimes – though not always – scientific. Here are a few recent examples:

1. Those of you with kids are probably lamenting right now the length of the summer holidays. But just look how much worse it could be if, for example, you were living in Italy (!):
2. Just for fun… a map of the United States showing the most commonly used word in each state:
3. A longitudinal slicing of the world by population size. It’s interesting because the population per size will depend both on the number of countries that are included as well as the population density in those slices.
4. For each country in the following map, the flag shown is that of the country with which it shares the longest border. For example, the UK has its longest border with Ireland, and so is represented by the Ireland flag. Similarly, France’s flag is that of Brazil!
5. This one probably only makes sense if you were born in, or have spent time living in, Italy
6. While this one will help you get clued-up on many important aspects of UK culture:
7. And finally, this one will help you understand how ‘per capita’ calculations are made. You might notice there’s one country with an N/A entry. Try to identify which country that is and explain why its value is  missing.

In summary, as you’ll see from these examples, the maps are usually fun, sometimes genuinely terrible, but sometimes contain a genuine pearl of statistical or geographical wisdom. If you have to follow someone on twitter, there are worse choices you could make.

# Zipf it

In a recent post I explained that in a large database of containing the words from many English language texts of various types, the word ‘football’ occurred 25,271 times, making it the 1543rd most common word in the database. I also said that the word ‘baseball’ occurred 28,851 times, and asked you to guess what its rank would be.

With just this information available, it’s impossible to say with certainty what the exact rank will be. We know that ‘baseball’ is more frequent than ‘football’ and so it must have a higher rank (which means a rank with a lower number). But that simply means it could be anywhere from 1 to 1542.

However, we’d probably guess that ‘baseball’ is not so much more popular a word than ‘football’; certainly other words like ‘you’, ‘me’, ‘please’ and so on are likely to occur much more frequently. So, we might reasonably guess that the rank of ‘baseball’ is closer to the lower limit of 1542 than it is to the upper limit of 1. But where exactly should we place it?

Zipf’s law provides a possible answer.

In its simplest form Zipf’s law states that for many types of naturally occurring data – including frequencies of word counts – the second most common word occurs half as often as the most common; the third most common occurs a third as often as the most popular; the fourth most common occurs a quarter as often; and so on. If we denote by f(r) the frequency of the item with rank r, this means that

$f(r) = C/r$

or

$r\times f(r)=C$,

where C is the constant f(1). And since this is true for every choice of r, the frequencies and ranks of the words ranked r and s are related by

$r\times f(r)=s \times f(s)$.

Then, assuming Zipf law applies,

$rank(\mbox{baseball'}) = rank(football') \times f(\mbox{football'})/f(\mbox{baseball'})$

$= 1543 \times 25271/28851 \approx 1352$

So, how accurate is this estimate? The database I extracted the data from is the well-known Brown University Standard Corpus of Present-Day American EnglishThe most common 5000 words in the database, together with their frequencies, can be found here. Searching down the list, you’ll find that the rank of ‘baseball’ is 1380, so the estimated value of 1352 is not that far out.

But where does Zipf’s law come from? It’s named after the linguist George Kingsley Zipf (1902-1950), who observed the law to hold empirically for words in different languages. Rather like Benford’s law, which we discussed in an earlier post, different arguments can be constructed that suggest Zipf’s law might be appropriate in certain contexts, but none is overwhelmingly convincing, and it’s really the body of empirical evidence that provides its strongest support.

Actually, Zipf’s law

$f(r) = C/r,$

is equivalent to saying that the frequency distribution follows a power law where the power is equal to -1. But many fits of the model to data can be improved by generalising this model to

$f(r)=C/r^k$

for some constant k. In this more general form the law has been shown to work well in many different contexts, including sizes of cities, website access counts, gene expression frequencies and strength of volcanic eruptions. The version with k=1 is found to work well for many datasets based on frequencies of word counts, but other datasets often require different values of k. But to use this more general version of the law we’d have to know the value of k, which we could estimate if we had sufficient amounts of data. The simpler Zipf’s law has k=1 implicitly, and so we were able to estimate the rank of ‘baseball’ with just the limited amount of information provided.

Finally, I had just 3 responses to the request for predictions of the rank of ‘baseball’: 1200, 1300 and 1450, each of which is entirely plausible. But if I regard each of these estimates as those of an expert and try combining those expert opinions by taking the average I get 1317, which is very close to the Zipf law prediction of 1352. Maybe if I’d had more replies the average would have been even closer to the Zipf law estimate or indeed to the true answer itself 😏.

# Word rank

I recently came across a large database of the use of English-American words. It aims to provide a representative sample of the usage English-American by including the words extracted from a large number of English texts of different types – books, newspaper articles, magazines etc. In total it includes around 560 million words collected over the years 1990-2017.

The word ‘football’ occurs in the database 25,271 times and has rank 1543. In principle, this means that ‘football’ was the 1543rd most frequent word in the database, though the method used for ranking the database elements is a little more complicated than that, since it attempts to combine a measure of both the number of times the word appears and the number of texts it appears in. Let’s leave that subtlety aside though and assume that ‘football’, with a frequency of 25,271, is the 1543rd most common word in the database.

The word ‘baseball’ occurs in the same database 28,851 times. With just this information, what would you predict the rank of the word ‘baseball’ to be? For example, if you think ‘baseball’ is the most common word, it would have rank 1. (It isn’t: ‘the’ is the most common word). If you think ‘baseball’ would be the 1000th most common word, your answer would be 1000.

Give it a little thought, but don’t waste time on it. I really just want to use the problem as an introduction to an issue that I’ll discuss in a future post. I’d be happy to receive your answer though, together with an explanation if you like, by mail. Or if you’d just like to fire an answer anonymously at me, without explanation, you can do so using this survey form.

# 1 in 562

Ever heard of the Fermi paradox? This phenomenon is named after the Italian physicist Enrico Fermi, and concerns the fact that though we’ve found no empirical evidence of extraterrestrial life, standard calculations based on our learned knowledge of the universe suggest that the probability of life elsewhere in our galaxy is very high. The theoretical side of the paradox is usually based on some variation of the Drake equation, which takes various known or estimated constants – like the number of observed stars in our galaxy, the estimated average number of planets per star, the proportion of these that are likely to be able to support life, and so on – and feeds them into an equation which calculates the expected number of alien civilisations in our galaxy.

Though there’s a lot of uncertainty about the numbers that feed into Drake’s equation, best estimates lead to an answer that suggests there should be millions of civilisations out there somewhere. And Fermi’s paradox points to the contrast between this number and the zero civilisations that we’ve actually observed.

Anyway, rather than try to go through any of this in greater detail, I thought I’d let this video do the explaining. And for fun, they suggest using the same technique to calculate the chances of you finding someone you are compatible with as a love partner.

Now, you probably don’t need me to explain all the limitations in this methodology, either for the evidence of alien life or for potential love partners with whom you are compatible. Though of course, the application to finding love partners is just for fun, right?

Well, yes and no. Here’s Rachel Riley of Countdown fame doing a barely-disguised publicity for eHarmony.

She uses pretty much the same methodology to show that you have…

… a 1 in 562 chance of finding love.

… get to know your colleagues

<Smartodds!!! I know!!!>

But it’s maybe not as bad as it sounds; she’s suggesting your colleagues might have suitable friends for you to pair up with, rather than your colleagues being potential love-partners themselves.

Finally, I’ll let you think about whether the methodology and assumptions used in Rachel’s calculations make sense or not. And maybe even try to understand what the 1 in 562 answer actually means, especially as a much higher proportion of people actually do end up in relationships. The opposite of Fermi’s paradox!

# By coincidence

In an earlier post I suggested we play a game. You’d pick a sequence of three outcomes of a coin toss, like THT. Then I’d pick a different triplet, say TTT. I’d then toss a coin repeatedly and whoever’s chosen triplet showed up first in the sequence would be the winner.

In the post I gave the following example…

H T H H H T T H T H …..

… and with that outcome and the choices above you’d have won since your selection of THT shows up starting on the 7th coin toss, without my selection of TTT showing up before.

The question I asked was who this game favoured. Assuming we both play as well as possible, does it favour

1. neither of us, because we both get to choose and the game is symmetric? Or;
2. you, because you get to choose first and have the opportunity to do so optimally? Or;
3. me, because I get to see your choice before I have to make mine?

The answer turns out to be 3. I have a big advantage over you, if I play smart. We’ll discuss what that means in a moment.

But in terms of these possible answers, it couldn’t have been 2. Whatever you choose I could have chosen the exact opposite and by symmetry, since H and T are equally likely, our two triplets would have been equally likely to occur first in the sequence. So, if you choose TTT, I choose HHH. If you choose HHT, I choose TTH and so on. In this way I don’t have an advantage over you, but neither do you over me. So we can rule out 2 as the possible answer.

But I can actually play better than that and have an advantage over you, whatever choice you make. I play as follows:

1. My first choice in the sequence is the opposite of your second choice.
2. My second and third choices are equal to your first and second choices.

So, if you chose TTT, I would choose HTT. If you chose THT, I would choose TTH. And so on. It’s not immediately obvious why this should give me an advantage, but it does. And it does so for every choice you can make.

The complete set of selections you can make, the selections I will make in response, and the corresponding odds in my favour are given in the following table.

Your Choice My Choice My win odds
HHH THH 7:1
HHT THH 3:1
HTH HHT 2:1
HTT HHT 2:1
THH TTH 2:1
THT TTH 2:1
TTH HTT 3:1
TTT HTT 7:1

As you can see, your best choice is to go for any of HTH, HTT, THH, THT, but even then the odds are 2:1 in my favour. That’s to say, I’m twice as likely to win as you in those circumstances. My odds increase to 3:1 – I’ll win three times as often as you – if you choose HHT or TTH; and my odds are a massive 7:1 – I’ll win seven times as often as you – if you choose HHH or TTT.

So, even if you play optimally, I’ll win twice as often as you. But why should this be so? The probabilities aren’t difficult to calculate, but most are a little more complicated than I can reasonably include here. Let’s take the easiest example though. Suppose you choose HHH, in which I case I choose THH. I then start tossing the coins. It’s possible that HHH will be the first 3 coins in the sequence. That will happen with probability 1/2 x 1/2 x 1/2 =1/8. But if that doesn’t happen, then there’s no way you can win. Because the first time HHH appears in the sequence it will have had to have been preceded by a T (otherwise HHH has occurred earlier). In which case my THH occurs before your HHH. So, you would have won with probability 1/8, and therefore I win with probability 7/8, and my odds of winning are 7:1.

Like I say, the other cases – except when you choose TTT, which is identical to HHH, modulo switching H’s and T’s – are a little more complicated, but the principle is essentially the same every time.

By coincidence, this game was invented by Walter Penney (Penney -> penny, geddit?), who published it in the Journal of Recreational Mathematics in 1969. It’s interesting from a mathematical/statistical/game-theory point of view because it’s an example of a non-transitive game. For example, looking at the table above, HHT is inferior to THH; THH is inferior to TTH; TTH is inferior to HTT; and HTT is inferior to HHT. Which brings us full circle. So, there’s no overall optimal selection. Each can be beaten by another, which in turn can be beaten by a different choice again. This is why the second player has an advantage: they can always find a selection that will beat the first player’s. It doesn’t matter that their choice can also be beaten, because the first player has already made their selection and it wasn’t that one.

The best known example of a non-transitive game is Rock-Paper-Scissors. Rock beats scissors; scissors beats paper; paper beats rock. But in that case it’s completely deterministic – rock will always beat paper, for example. In the Penney coin tossing game, HTT will usually beat TTT, but occasionally it won’t. So, it’s perhaps better defined as a random non-transitive game.

The game also has connections with genetics. Strands of DNA are long chains composed of sequences of individual molecules known as nucleotides. Only four different types of nucleotide are found in DNA strands, and these are usually labelled A, T, G and C. The precise ordering of these nucleotides in the DNA strand effectively define a code that will determine the characteristics of the individual having that DNA.

It’s not too much of a stretch of the imagination to see that a long sequence of nucleotides A, T, G and C is not so very different – albeit with 4 variants, rather than 2 – from a sequence of outcomes just like those from the tosses of a coin. Knowing which combinations of the nucleotides are more likely to occur than others, and other combinatorial aspects of their arrangements, proves to be an important contribution of Statistics to the understanding of genetics and the development of genetic intervention therapies. Admittedly, it’s not a direct application of Penney’s game, but the statistical objectives and techniques required are not so very different.

Thanks to those of you who wrote to me with solutions to this problem, all of which were at least partially correct.

# An Uberlord from the world of football

Ok, this isn’t strictly Statistics, but I came across it while researching for another post and it seemed fun, so I thought I’d share it.

JiGaZo is a 300-piece jigsaw puzzle with a difference. The pieces are identically-shaped, 90-degree rotationally symmetric, sepia-coloured (to different degrees of shading) and have a colour-coded symbol on the back. You take a picture of yourself or anyone else. You upload that picture to your computer, run it through the software provided, and it spits out a grid of the codes on the back of the jigsaw pieces. You then construct the jigsaw following those coded instructions, and the result is a jigsaw reproduction of the image you uploaded.

Perhaps it’s better explained by the following ad:

You can get JiGaZo for around a tenner at Amazon. Just picture your loved one’s face on Christmas Day when they realise that not only have you constructed a 300-piece jigsaw of them as a present, but they can also disassemble it and return the favour to you in time for Boxing Day.

Christmas shopping ideas: just one of the services provided by Smartodds loves Statistics.

Now, to give this post some relevance to Statistics, we might ask how many unique images can be made with a JiGaZo set? These are the types of calculations we often have to make when enumerating probabilities in all sorts of statistical problems.

Have a quick guess at what the answer might be before scrolling down…

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

So, first we can calculate the number of ways the 300 pieces can be placed in the grid. There are 300 choices for the first piece; that leaves 299 for the second; then 298 for the third; and so on. Since each of these choices combines with each of the others, the total number of such arrangements is

$300 \times 299 \times 298 \times .... \times 3 \times 2 \times 1$

By convention this is written 300! and it’s HUGE: approximately the number 3 followed by 614 zeros.

To put that in perspective, it’s believed the number of stars in the observable universe is ‘only’ 1 followed by 21 zeros. So, if every star had its own universe, and every star in that universe had its own universe, and every star in that universe had its own universe and we kept doing that – putting a universe on every new star – a total of around 29 times, then we’d have about 300! stars in total.

But that’s not all. For every one of those 300! arrangements of the Jigazo pieces, each piece can be rotated in 4 different ways. So we have to multiply 300! by 4 three hundred times. Finally we divide that answer by 2 since any arrangement is arguably the same if it’s upside-down – I can just rotate it and get the same picture.

$300! \times 4^{300} /2$,

which is roughly 6 followed by 794 zeros. So, apply that universe on every star procedure another 8 times or so, and you get close to the number of unique JiGaZo images.

Since even the fastest computer in the world would take much longer than the age of the universe to run through all of those possibilities, you start to realise that the software that comes with JiGaZo, which aims to find a pretty good match for any input, must be a smart piece of image mapping.

But crucially… how well does it work? I guess the answer to that is determined by how easily you can recognise the following Uberlord from the world of football…

# Favourtism

Let’s play a game. I’ve got a coin here, and I’m going to toss it repeatedly and record the sequence of outcomes: heads (H) and tails (T).

Here we go…

H T H H H T T H T H …..

That was fun.

Next, I’ll do that again, but before doing so I’ll ask you to make a prediction for a sequence of 3 tosses. Then I’ll do the same, making a different choice. So you might choose THT and then I might choose TTT. I’ll then start tossing the coin. The winner will be the person whose chosen triplet shows up first in the sequence of tosses.

So, if the coin showed up as in the sequence above, you’d have won because there’s a sequence of THT starting from the 7th toss in the sequence. If the triplet TTT had shown up before that – which it didn’t -then I’d have won.

Now, assuming we both play optimally, there are 3 possibilities for who this game might favour (in the sense of having a higher probability of winning):

1. It favours no one. We both get to choose our preferred sequence and so, by symmetry, our chances of winning are equal.
2. It favours you. You get to choose first and so you can make the optimal choice before I get a chance.
3. It favours me. I get to see your choice before making mine and can make an optimal choice accordingly.

Which of these do you think is correct? Have a think about it. You might even have to decide what it means to play ‘optimally’.

If you’d like to mail me with your answers I’d be happy to hear from you. In a subsequent post I’ll discuss  the solution with reasons why this game is important.

# Faking it

Take a look at the following table:

It shows the total land area, in square kilometres, for various countries. Actually, it’s the first part of a longer alphabetical list of all countries and includes two columns of figures, each purporting to be the corresponding area of each country. But one of these columns contains the real areas and the other one is fake. Which is which?

Clearly, if your knowledge of geography is good enough that you know the land area of Belgium – or any of the other countries in the table – or whether Bahrain is bigger than Barbados, then you will know the answer. You could also cheat and check with Google. But you can answer the question, and be almost certain of being correct, without cheating and without knowing anything about geography. Indeed, I could have removed the first column giving the country names, and even not told you that the data correspond to land areas, and you should still have been able to tell me which column is real and which is fake.

So, which column is faking it? And how do you know?

I’ll write a follow-up post giving the answer and explanation sometime soon. Meantime, if you’d like to write to me giving your own version, I’d be happy to hear from you.