News Archive
PhpRiot Newsletter
Your Email Address:

More information

Expecting the unexpected with Good-Turing

Note: This article was originally published at Planet PHP on 29 October 2011.
Planet PHP

A lot of interesting techniques involve taking statistical samples, and using those to predict what we'll see in the future. Usually this works pretty well, but when we're dealing with a lot of options or if we have some options that are very rare that approach can go pretty wrong. If we go down the street and note down how many men and women we see, we'll probably be able to use that to predict the chance of the next person we see being male or female pretty well. However, if we were counting all the species of animals we encounter, and trying to use that to predict what we'll see in the future, we'd likely run in to a couple of problems.

One, we wouldn't have any way of reserving any of our expectation for seeing new animals, but it is totally possible that we will encounter an animal we hadn't seen in our first sample as there are a lot of different types. Two, we'll probably over-estimate the probability of seeing a more unusual type of animals if we did happen to encounter one in our first sample. The same problems exist if we try and calculate the probability of seeing different words we have found from a certain set of documents, which can be a problem in search and machine learning.

Alan Turing and IJ Good came up with some useful methods for estimating these kind of rare and unseen events. This includes a measure for how much of the population is covered by the sample: 1 - (n1 / N). N being the number of observations in the sample, and n1 is the number of types that have only been seen once.

More usefully for us, we can calculate how often we should expect to see new types. Concretely, lets look at the words in some documents. We'll write a small function that tokenises a document and counts up all the times each word appears:

function generate_stats($url) {
A A A A $text = file_get_contents($url);
A A A A preg_match_all('/[a-zA-Z]+/', $text, $matches);
A A A A $words = array();
A A A A foreach($matches[0] as $match) {
A A A A A A A A $match = strtolower($match);
A A A A A A A A if(!isset($words[$match])) {
A A A A A A A A A A A A $words[$match] = 0;
A A A A A A A A }
A A A A A A A A $words[$match]++;
A A A A }
A A A A asort($words);
A A A A return $words;

We can then calculate the proportion of new words we should expect in another document, as the number of words that appeared once divided by the total number of different words.

$proportion =

Truncated by Planet PHP, read more at the original (another 14891 bytes)