PhpRiot
Become Zend Certified

Prepare for the ZCE exam using our quizzes (web or iPad/iPhone). More info...


When you're ready get 7.5% off your exam voucher using voucher CJQNOV23 at the Zend Store

Creating A Fulltext Search Engine In PHP 5 With The Zend Framework's Zend Search Lucene

Creating Our First Index

The basic process for creating an index is:

  1. Open the index
  2. Add each document
  3. Commit (save) the index

The index is stored on the filesystem, so when you open an index, you specify the path where the it is to be stored. This is achieved calling the static open() method of the Zend_Search_Lucene class. The first and only argument to this method is the filesystem path of the index.

Listing 4 listing-4.php
<?php
    require_once('Zend/Search/Lucene.php');
 
    $indexPath = '/var/www/phpriot.com/data/docindex';
 
    $index = Zend_Search_Lucene::create($indexPath);
?>

Adding a document to our index

Now that we have opened our index, we must add our documents to it. We create a document to be added to the index using the following:

Listing 5 listing-5.php
<?php
    $doc = new Zend_Search_Lucene_Document();
?>

The next thing we must do is determine which fields we need to add to our index.

There are several different field types that can be stored for a single document. As we mentioned earlier, we want to store the keywords for our document, and also store some extra data that we can use when fetching the data. If we wanted to, we could duplicate the entire document content in our search index, however, this would be overkill and is unnecessary. It’s not as bad to duplicate smaller items such as the author name or document title as it’s only a small amount of data.

As you can tell from the previous paragraph, we’re starting to determine which fields we want to store in our index:

  • Document content – but we don’t want to duplicate the entire article, as we’re only going to show users a preview of the article in the search results
  • Title – we’re definitely going to include the title in our results
  • Article teaser – even though we’re not storing the content, we do want to store a short summary of the document to display in our search results
  • Author – we will list the author of each document in our search results, plus we want to give users the ability to search by author
  • Created – We’ll also store a timestamp of when the article was created.

This brings us to the different types of fields we can use for indexing in Zend_Search_Lucene:

  • Keyword – Data that is searchable and stored in the index, but not broken up into tokens for indexing. This is useful for being able to search on non-textual data such as IDs or URLs.
  • UnIndexed – Data that isn’t available for searching, but is stored with our document (article teaser, article URL and timestamp of creation)
  • UnStored – Data that is available for search, but isn’t stored in the index in full (the document content)
  • Text – Data that is available for search and is stored in full (title and author)

There is also the Binary field available, but we won’t be using it in this example. Binary data isn’t searchable, but could be used to store an image with an indexed document.

To add a field to our indexed document, we use the addField() method. When we call this method, we must pass data that has been prepared according to the necessary field type. This is achieved used the Zend_Search_Lucene_Field class with the field type as the static method name.

In other words, to create the title field data, we use:

Listing 6 listing-6.php
<?php
    $data = Zend_Search_Lucene_Field::Text('title', $docTitle);
?>

Note that we specify the name of the field here. We can later reference this data by this field name, but we will cover this in the section about querying the index.

So to add all the data with the field types we just worked out, we would use this:

Listing 7 listing-7.php
<?php
    $doc = new Zend_Search_Lucene_Document();
 
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('url', $docUrl));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', $docCreated));
    $doc->addField(Zend_Search_Lucene_Field::UnIndexed('teaser', $docTeaser));
    $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
    $doc->addField(Zend_Search_Lucene_Field::Text('author', $docAuthor));
    $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docBody));
?>

Finally, we add the document to the index using addDocument():

Listing 8 listing-8.php
<?php
    $index->addDocument($doc);
?>

We will cover this code in full in the next chapter (including looping over several documents and adding them all at once).

Committing / saving the index

Once all documents have been added, the index must be saved.

Listing 9 listing-9.php
<?php
    $index->commit();
?>

You can call commit after adding each document, but this operation is expensive and each call results in a new index segment (in lamen’s terms, this means it’s an extra file to search when querying the index). Note that better management of the segments in the index is a planned future improvement, but for the time being it’s best to be careful and call it once after you’ve done what you need to do with the index.

If you were to look at your filesystem now that you have committed the index, you would notice a directory at the location you specified when opening the index. This directory contains a number of files which are then used when querying the index.

Next up we will piece all the above code into a single script which indexes all the articles on phpRiot.

In This Article


Article History

Apr 27, 2006
Initial article version
Dec 17, 2007
Updated to use Zend Framework 1.0.3