PhpRiot
Become Zend Certified

Prepare for the ZCE exam using our quizzes (web or iPad/iPhone). More info...


When you're ready get 7.5% off your exam voucher using voucher CJQNOV23 at the Zend Store

Text Analysis

The Zend_Search_Lucene_Analysis_Analyzer class is used by the indexer to tokenize document text fields.

The Zend_Search_Lucene_Analysis_Analyzer::getDefault() and Zend_Search_Lucene_Analysis_Analyzer::setDefault() methods are used to get and set the default analyzer.

You can assign your own text analyzer or choose it from the set of predefined analyzers: Zend_Search_Lucene_Analysis_Analyzer_Common_Text and Zend_Search_Lucene_Analysis_Analyzer_Common_Text_CaseInsensitive (default). Both of them interpret tokens as sequences of letters. Zend_Search_Lucene_Analysis_Analyzer_Common_Text_CaseInsensitive converts all tokens to lower case.

To switch between analyzers:

<?php
Zend_Search_Lucene_Analysis_Analyzer
::setDefault(
    new 
Zend_Search_Lucene_Analysis_Analyzer_Common_Text());
...
$index->addDocument($doc);

The Zend_Search_Lucene_Analysis_Analyzer_Common class is designed to be an ancestor of all user defined analyzers. User should only define the reset() and nextToken() methods, which takes its string from the $_input member and returns tokens one by one (a NULL value indicates the end of the stream).

The nextToken() method should call the normalize() method on each token. This will allow you to use token filters with your analyzer.

Here is an example of a custom analyzer, which accepts words with digits as terms:

Example 704. Custom text Analyzer

<?php
/**
 * Here is a custom text analyser, which treats words with digits as
 * one term
 */

class My_Analyzer extends Zend_Search_Lucene_Analysis_Analyzer_Common
{
    private 
$_position;

    
/**
     * Reset token stream
     */
    
public function reset()
    {
        
$this->_position 0;
    }

    
/**
     * Tokenization stream API
     * Get next token
     * Returns null at the end of stream
     *
     * @return Zend_Search_Lucene_Analysis_Token|null
     */
    
public function nextToken()
    {
        if (
$this->_input === null) {
            return 
null;
        }

        while (
$this->_position strlen($this->_input)) {
            
// skip white space
            
while ($this->_position strlen($this->_input) &&
                   !
ctype_alnum$this->_input[$this->_position] )) {
                
$this->_position++;
            }

            
$termStartPosition $this->_position;

            
// read token
            
while ($this->_position strlen($this->_input) &&
                   
ctype_alnum$this->_input[$this->_position] )) {
                
$this->_position++;
            }

            
// Empty token, end of stream.
            
if ($this->_position == $termStartPosition) {
                return 
null;
            }

            
$token = new Zend_Search_Lucene_Analysis_Token(
                                      
substr($this->_input,
                                             
$termStartPosition,
                                             
$this->_position -
                                             
$termStartPosition),
                                      
$termStartPosition,
                                      
$this->_position);
            
$token $this->normalize($token);
            if (
$token !== null) {
                return 
$token;
            }
            
// Continue if token is skipped
        
}

        return 
null;
    }
}

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
    new 
My_Analyzer());


Zend Framework