In a recent post I described the following problem which I encountered while sitting in a dentist waiting room:
Images are randomly selected from a library of images and shown on a screen. After watching the screen for a while, I notice one or more of the images is a repeat showing of an earlier image. How can I use information on the number of images observed and the number of repeats to estimate how many images there are in the entire library?
I had two great replies suggesting solutions to this problem. The first was from Nity.Raj@smartodds.co.uk
Surely the efficient thing to do is to hack the database of images so you just find out how many there are in fact, rather than estimating?
It’s the perfect answer, but I just need to run it past someone with a legal background who’s connected to Smartodds to check it’s compliant with relevant internet communication laws. Can anyone suggest somebody suitable?
The other idea was from Ian.Rutherford@smartbapps.co.uk who suggested this:
I would take the total of all the images seen and divide it by the number of times I spotted the 23 to Leigh Park to give an estimation of the number of different images
You’ll have to read the original post to understand the ’23 to Leigh Park’ bit of this answer, but you can take it as a reference to any one of the images that you’ve seen. So, let’s suppose I’ve seen 100 images, and I’ve seen one particular image that I’m interested in 4 times. Then Ian’s suggestion is to estimate the total number of images as
Ian didn’t explain his answer, so I hope I’m not doing him a disservice, but I think the reasoning for this solution is as follows. Suppose the population size is N and I observe v images. Then since the images occur at random, the probability I will see any particular image when a random image is shown is 1/N. So the average, or expected, number of times I will see a particular image in a sequence of v images is v/N. If I end up seeing the image t times, this means I should estimate v/N with t. But rearranging this, it means I estimate N with v/t.
It’s a really smart answer, but I think there are two slight drawbacks.
- Suppose, in the sequence of 100 images, I’d already seen 26 (or more) different images. In that case I’d know the estimate of 25 was bound to be an under-estimate.
- This estimate uses information based on the number of repeats of just one image. Clearly, the number of repeats of each of the different images I observe is equally relevant, and it must be wasteful not to use the information they contain as well.
That said, the simplicity and logic of the answer are both extremely appealing.
But before receiving these answers, and actually while waiting at the dentist, I had my own idea. I’m not sure it’s better than Nity’s or Ian’s, and it has its own drawbacks. But it tells a nice story of how methods from one area of Statistics can be relevant for something apparently unrelated.
So, imagine you’re an ecologist and there’s concern that pollution levels have led to a reduction in the number of fish in a lake. To assess this possibility you need to get an estimate of how many fish there are in the lake. The lake is large and deep, so surface observations are not useful. And you don’t have equipment to make sub-surface measurements.
What are you going to do?
Have a think about this before scrolling down.
One standard statistical approach to this problem is a technique called mark and recapture. There are many variations on this method, some quite sophisticated, but we’ll discuss just the simplest, which works as follows.
A number of fish are caught (unharmed), marked and released back into the lake. Let this number of fish be n, say.
Some time later, a second sample of fish – let’s say of size K – is taken from the lake. We observe that k fish of this second sample have the mark that we applied in the first sample. So k/K is the proportion of fish in the second sample that have been marked. But since this is just a random sample from the lake, we’d expect this proportion to be similar to the proportion of marked fish in the entire lake, which will be n/N.
Expressing this mathematically, we have an approximation
But we can rearrange this to get:
In other words, we could use
as an estimate for the number of fish, since we’d expect this to be a reasonable approximation to the actual number N.
So, let’s suppose I originally caught, marked and released 100 fish. I subsequently catch a further 50 fish, of which 5 are marked. Then, n=100, K=50, k=5 and so
and I’d estimate that the lake contains 1000 fish.
Now, maybe you can see where this is going. Suppose instead of a lake of fish, we have a library of images. This method would allow me to estimate the size of the population of images, just as it does a population of fish. But there’s a slight catch (if you’ll pardon the pun). When I take a sample of fish from a lake, each of the fish in the sample is unique. But when I look at a selection of images at the dentist, some of them may be repeats. So I can’t quite treat my sample of images in exactly the same way as I would a sample of fish. To get round this problem I have to ignore the repeated images within each sample. So, my strategy is this:
- Observe a number of the images, ignoring any repeats. Call the number of unique images n.
- Observe a second set of images. Let the number of unique images in this set be K, but keeping count of repeats with the first set. Let’s say the number of repeats with the first sample is k.
The estimate of the population size – for the same reasons as estimating fish population sizes – is then
So, suppose I chose to look at images for 10 minutes. In that period there were 85 images, but 5 of these were repeats. So, n=80. I then watch for another 5 minutes and observe 30 unique images, 4 of which were also observed in the first sample. So, n=80, K=30, m=4 and my estimate of the number of images in the database is
Is this answer any better than Ian’s? I believe it uses more information available in the data, since it doesn’t focus on just one image. It’s also less likely to give an answer that is inconsistent with the data that I’ve already seen. But it does have drawbacks and limitations:
- Ignoring the information on repeats within each sample must also be wasteful of relevant information.
- The distinction between the first sample and second sample is arbitrary, and it might be that different choices lead to different answers.
- Keeping track of repeats within and across the two samples might be difficult in practice.
In a subsequent post I’ll do a more detailed study of the performance of the two methods. In the meantime, let me summarise what I think are the main points from this discussion:
- Statistical problems can occur in the most surprising places
- There’s usually no right or wrong way of tackling a statistical problem. One approach might be best from one point of view, while another is better from a different point of view.
- Statistics is a very connected subject: a technique that has been developed for one type of problem may be transferable to a completely different type of problem.
- Simple answers are not always be the best – though sometimes they are – but simplicity is a virtue in itself.
Having said all that, there are various conventional ways of judging the performance of a statistical procedure, and I’ll use some of these to compare my solution with Iain’s in the follow-up post. Meantime, I’d still be happy to receive alternative solutions to the problem, whose performance I can also compare against mine and Ian’s.