# Animal experiments

Ever thought your cat might be trolling you? Turns out you’re right. As explained in this New Scientist article, recent Japanese research concludes that cats are entirely capable of recognising their names; they just choose not to when it suits them.

The full details of the experiment are included in a research report published in Nature. It’s an interesting, though not entirely easy, read. But I’d like to use it to point out an aspect of statistical methodology that is often ignored: statistical analyses don’t usually start with the analysis of data; they start with the design of the experiment by which the data are to be collected. And it’s essential that an experiment is designed correctly in order to be able to use Statistics to answer the question you’re interested in.

So, in this particular study, the researchers carried out four separate experiments:

• In experiment 1, the ability of cats to distinguish their own names from that of other similar nouns was tested;
• In experiment 2, cats living with numerous other cats were tested to see if they could distinguish their own name from that of other cats in the same household;
• Experiment 3 was like experiment 1, but using cats from a ‘cat cafe‘ (don’t ask) rather than a normal household;
• Experiment 4 was also like experiment 1, but using a voice other than the cat’s owner to trigger the responses.

Through this sequence of experiments, the researchers were able to judge whether or not the cats genuinely recognise and respond to their own names in a variety of environments, and to exclude the possibility that the responses were due to factors other than actual name recognition. As such, this is a great example of how the design of an experiment has been carefully tailored to ensure that a statistical analysis of the data it generates is able to answer the question of interest.

I won’t go into details, but there are many other aspects of the experimental design that also required careful specification:

1. The number of cats to be included in the study;
2. The choice of words to use as alternative stimuli to the cats’ names, and the order in which they are used;
3. The definitions of actions that are considered positive responses to stimuli;
4. The protocol for determining whether a cat has responded positively to a stimuli or not;

amongst others. Full details are available in the Nature article, as indeed are the data, should you wish to analyse them yourself.

In the context of sports modelling, these kinds of issues are less explicit, since analyses are usually retrospective, using data that have already been historically collected and stored. Nonetheless, the selection of which data to include in an analysis can affect the analysis, and it’s important to ensure that results are not sensitive to specific, subjective choices. However, for analyses of data that include a decision process – such as betting strategies – it may well be relevant to formulate an experimental design for a prospective study, comparing results based on one type of strategy, compared with that of another. We’ll discuss strategies for this type of experiment in a future post.

# Altered images

In a recent post I described the following problem which I encountered while sitting in a dentist waiting room:

Images are randomly selected from a library of images and shown on a screen. After watching the screen for a while, I notice one or more of the images is a repeat showing of an earlier image. How can I use information on the number of images observed and the number of repeats to estimate how many images there are in the entire library?

I had two great replies suggesting solutions to this problem. The first was from Nity.Raj@smartodds.co.uk

Surely the efficient thing to do is to hack the database of images so you just find out how many there are in fact, rather than estimating?

It’s the perfect answer, but I just need to run it past someone with a legal background who’s connected to Smartodds to check it’s compliant with relevant internet communication laws. Can anyone suggest somebody suitable?

The other idea was from Ian.Rutherford@smartbapps.co.uk who suggested this:

I would take the total of all the images seen and divide it by the number of times I spotted the 23 to Leigh Park to give an estimation of the number of different images

You’ll have to read the original post to understand the ’23 to Leigh Park’ bit of this answer, but you can take it as a reference to any one of the images that you’ve seen. So, let’s suppose I’ve seen 100 images, and I’ve seen one particular image that I’m interested in 4 times. Then Ian’s suggestion is to estimate the total number of images as

$100/4=25$

Ian didn’t explain his answer, so I hope I’m not doing him a disservice, but I think the reasoning for this solution is as follows. Suppose the population size is N and I observe v images. Then since the images occur at random, the probability I will see any particular image when a random image is shown is 1/N. So the average, or expected, number of times I will see a particular image in a sequence of v images is v/N. If I end up seeing the image t times, this means I should estimate v/N with t. But rearranging this, it means I estimate N with v/t.

It’s a really smart answer, but I think there are two slight drawbacks.

1. Suppose, in the sequence of 100 images, I’d already seen 26 (or more) different images. In that case I’d know the estimate of 25 was bound to be an under-estimate.
2. This estimate uses information based on the number of repeats of just one image. Clearly, the number of repeats of each of the different images I observe is equally relevant, and it must be wasteful not to use the information they contain as well.

That said, the simplicity and logic of the answer are both extremely appealing.

But before receiving these answers, and actually while waiting at the dentist, I had my own idea. I’m not sure it’s better than Nity’s or Ian’s, and it has its own drawbacks. But it tells a nice story of how methods from one area of Statistics can be relevant for something apparently unrelated.

So, imagine you’re an ecologist and there’s concern that pollution levels have led to a reduction in the number of fish in a lake. To assess this possibility you need to get an estimate of how many fish there are in the lake.  The lake is large and deep, so surface observations are not useful. And you don’t have equipment to make sub-surface measurements.

What are you going to do?

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

One standard statistical approach to this problem is a technique called mark and recapture. There are many variations on this method, some quite sophisticated, but we’ll discuss just the simplest, which works as follows.

A number of fish are caught (unharmed), marked and released back into the lake. Let this number of fish be n, say.

Some time later, a second sample of fish – let’s say of size K – is taken from the lake. We observe that k fish of this second sample have the mark that we applied in the first sample. So k/K is the proportion of fish in the second sample that have been marked. But since this is just a random sample from the lake, we’d expect this proportion to be similar to the proportion of marked fish in the entire lake, which will be n/N.

Expressing this mathematically, we have an approximation

$k/K \approx n/N$

But we can rearrange this to get:

$N \approx nK/k$

In other words, we could use

$\hat{N}= nK/k$

as an estimate for the number of fish, since we’d expect this to be a reasonable approximation to the actual number N.

So, let’s suppose I originally caught, marked and released 100 fish. I subsequently catch a further 50 fish, of which 5 are marked. Then, n=100, K=50, k=5 and so

$\hat{N} = nK/k = 100 \times 50 /5 =1000$

and I’d estimate that the lake contains 1000 fish.

Now, maybe you can see where this is going. Suppose instead of a lake of fish, we have a library of images. This method would allow me to estimate the size of the population of images, just as it does a population of fish. But there’s a slight catch (if you’ll pardon the pun). When I take a sample of fish from a lake, each of the fish in the sample is unique. But when I look at a selection of images at the dentist, some of them may be repeats. So I can’t quite treat my sample of images in exactly the same way as I would a sample of fish. To get round this problem I have to ignore the repeated images within each sample. So, my strategy is this:

1. Observe a number of the images, ignoring any repeats. Call the number of unique images n.
2. Observe a second set of images. Let the number of unique images in this set be K, but keeping count of repeats with the first set. Let’s say the number of repeats with the first sample is k.

The estimate of the population size – for the same reasons as estimating fish population sizes – is then

$\hat{N} = nK/k$.

So, suppose I chose to look at images for 10 minutes. In that period there were 85 images, but 5 of these were repeats. So, n=80. I then watch for another 5 minutes and observe 30 unique images, 4 of which were also observed in the first sample. So, n=80, K=30, m=4 and my estimate of the number of images in the database is

$\hat{N} = nK/k = 80 \times 30 /4 =600$

Is this answer any better than Ian’s? I believe it uses more information available in the data, since it doesn’t focus on just one image. It’s also less likely to give an answer that is inconsistent with the data that I’ve already seen. But it does have drawbacks and limitations:

1. Ignoring the information on repeats within each sample must also be wasteful of relevant information.
2. The distinction between the first sample and second sample is arbitrary, and it might be that different choices lead to different answers.
3. Keeping track of repeats within and across the two samples might be difficult in practice.

In a subsequent post I’ll do a more detailed study of the performance of the two methods. In the meantime, let me summarise what I think are the main points from this discussion:

1. Statistical problems can occur in the most surprising places
2. There’s usually no right or wrong way of tackling a statistical problem. One approach might be best from one point of view, while another is better from a different point of view.
3. Statistics is a very connected subject: a technique that has been developed for one type of problem may be transferable to a completely different type of problem.
4. Simple answers are not always be the best – though sometimes they are – but simplicity is a virtue in itself.

Having said all that, there are various conventional ways of judging the performance of a statistical procedure, and I’ll use some of these to compare my solution with Iain’s in the follow-up post. Meantime, I’d still be happy to receive alternative solutions to the problem, whose performance I can also compare against mine and Ian’s.