Creating A Fulltext Search Engine In PHP 5 With The Zend Framework's Zend Search Lucene
How Fulltext Indexing And Query Works
Just in case you’ve never dealt with the subject before, I will briefly cover how the whole thing works.
The index essentially stores a special copy of all your data, but instead of storing the actual content of your documents, it stores a list of keywords found in each document.
So if I were to create an index of all the documents on PhpRiot, this would be done by looping over every document, finding the keywords and then adding those words to the document.
Additionally, since we can’t just list a bunch of keywords to a user when they perform a search, we also store some other data in the index along with the keywords. In the case of PhpRiot, we will store the document title, the author, the document URL and a brief summary of the article.
Alternatively we could just store the document ID, and then proceed to look up those other details from the database using the ID, however it is quicker to store the pertinent data in the index (less database queries), and since the title and author are only a few words for each document, there is little overhead in duplicating this data by storing it in our index.
Querying the data
Once we have created the index, we can then allow people to search it. When somebody types in a search term, we look for those terms in our index, which then points back to 0 or more different documents. Zend_Search_Lucene will also score the results for us, meaning the most relevant matches are returned first.
Keeping the index up-to-date
If a document is ever updated in the database, this likely means the index will be out-of-date. This means we have to make an extra consideration in our general content management. That is, we need to re-index that document. In order to update a document in Zend_Search_Lucene index, it must be removed then re-added to the index.