Creating A Fulltext Search Engine In PHP 5 With The Zend Framework's Zend Search Lucene
Creating Our First Index
The basic process for creating an index is:
- Open the index
- Add each document
- Commit (save) the index
The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved calling the static
open() method of the
Zend_Search_Lucene class. The first and only argument to this method is the filesystem path of the index.
require_once('Zend/Search/Lucene.php'); $indexPath = '/var/www/phpriot.com/data/docindex'; $index = Zend_Search_Lucene::create($indexPath);
Adding a document to our index
Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:
$doc = new Zend_Search_Lucene_Document();
The next thing we must do is determine which fields we need to add to our index.
There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.
As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:
- Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
- Title – we’re definitely going to include the title in our results
- Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
- Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
- Created – We’ll also store a timestamp of when the article was created.
This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:
Keyword– Data that is searchable and stored in the index, but not broken up into tokens for indexing. This is useful for being able to search on non-textual data such as IDs or URLs.
UnIndexed– Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
UnStored– Data that is available for search, but isn’t stored in the index in full (the document content)
Text– Data that is available for search and is stored in full (title and author)
There is also the
Binary field available, but we won’t be using it in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.
To add a field to our indexed document, we use the
addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Z
end_Search_Lucene_Field class with the field type as the static method name.
In other words, to create the
title field data, we use:
$data = Zend_Search_Lucene_Field::Text('title', $docTitle);
Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.
So to add all the data with the field types we just worked out, we would use this:
$doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl)); $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated)); $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser)); $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle)); $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor)); $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
Finally, we add the document to the index using
We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).
Committing / saving the index
Once all documents have been added, the index must be saved.
You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.
If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.
Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.