Creating A Fulltext Search Engine In PHP 5 With The Zend Framework's Zend Search Lucene
There are several aspects of Zend_Search_Lucene that can be extended, allowing a fully customized search solution. These include:
- A custom tokenizer for determining keywords in a document
- Custom scoring algorithms to determine how well a document matches a search query
- A custom storage method, to your index is stored however and wherever you please
A custom tokenizer
There are many reasons why a custom tokenizer can be useful. Here are some ideas:
- PDF tokenizer – a tokenizer that can parse a PDF file and find all the keywords
- Image tokenizer – a tokenizer that can perform Optical Character Recognition (OCR), thereby allowing you to index words in an image (and you could store the image also, using the Binary field type)
- HTML tokenizer – a tokenizer than can read HTML data, thereby knowing not to index HTML keywords but only the actual content. You could make further improvements on this also, such as finding all headings and treating them with higher preference to the rest of the content.
Custom scoring algorithms
Using a custom scoring algorithm, you can determine how favourably different fields in a document are looked upon. For example, you might want to treat matches in the ‘title’ field (if you have one) much more favourably than matches in the ‘content’ field.
Custom storage method
You can change how indexes are written to disk by extending the Zend_Search_Lucene_Storage_Directory and Zend_Search_Lucene_Storage_File classes.
It may or may not be possible to change this store all indexed data in a database, but I haven’t actually tried this so I’m not sure.
More information on extending Zend_Search_Lucene can be found at http://framework.zend.com/manual/en/zend.search.lucene.extending.html.