News Archive
PhpRiot Newsletter
Your Email Address:

More information

Don't look for meanings, count mentions

Note: This article was originally published at Planet PHP on 7 October 2011.
Planet PHP
It's Ada Lovelace day, giving me a (not often needed) excuse to talk about one of the most interesting people that has worked in information retrieval, Professor Karen SpArck Jones. She worked at the University of Cambridge almost up until her death in 2007, and made significant contributions to natural language processing, machine translation, and particularly to search. In my eyes at least, her most significant contribution was IDF term weighting. IDF stands for inverse document frequency, and measures how rare a given word is in the collection of documents being indexed by a search system. This turns out to be a really solid factor for calculating search rankings - more common words are generally less indicative of good matches between a search query and a document than rare words - and it is used in the basic ranking algorithm for simple vector space search, TF-IDF (I blogged about that a while back).Professor SpArck Jones' later research included working on the probabilistic model of information retrieval, which incorporated the idea that a document X should be ranked on the probability that it is relevant to query Y. This provided a strong theoretical background to information retrieval, and lead to the development of the types of ranking algorithms that regularly outperform the competition in trials and real world usage.Some of the most interesting work towards the end of her career was around the problem of summarising text, the ability to extract the core 'message' of a given document, and express it in a significantly shorter form. Like most of her work, summarisation reflects the power of statistical approaches, and just how far you can go in natural language processing without relying on complex, and rigid, language modelling (hence the quote I used for the title of this post). However, one of the things I, and I'm sure many others interested in information retrieval, appreciate is the clarity and wit in her writing. For researchers that work with natural language every day, it is surprisingly common to publish papers and presentations which are extremely dry and impenetrable, and Karen SpArck Jones was neither.