The Zend_Search_Lucene_Analysis_Analyzer_Common analyzer also
offers a token filtering mechanism.
The Zend_Search_Lucene_Analysis_TokenFilter class provides an
abstract interface for such filters. Your own filters should extend this class either
directly or indirectly.
Any custom filter must implement the normalize() method which
may transform input token or signal that the current token should be skipped.
There are three filters already defined in the analysis subpackage:
Zend_Search_Lucene_Analysis_TokenFilter_LowerCaseZend_Search_Lucene_Analysis_TokenFilter_ShortWordsZend_Search_Lucene_Analysis_TokenFilter_StopWords
The LowerCase filter is already used for
Zend_Search_Lucene_Analysis_Analyzer_Common_Text_CaseInsensitive
analyzer by default.
The ShortWords and StopWords filters may be used with
pre-defined or custom analyzers like this:
<?php
$stopWords = array('a', 'an', 'at', 'the', 'and', 'or', 'is', 'am');
$stopWordsFilter =
new Zend_Search_Lucene_Analysis_TokenFilter_StopWords($stopWords);
$analyzer =
new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($stopWordsFilter);
Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
<?php
$shortWordsFilter = new Zend_Search_Lucene_Analysis_TokenFilter_ShortWords();
$analyzer =
new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($shortWordsFilter);
Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
The Zend_Search_Lucene_Analysis_TokenFilter_StopWords constructor
takes an array of stop-words as an input. But stop-words may be also loaded from a file:
<?php
$stopWordsFilter = new Zend_Search_Lucene_Analysis_TokenFilter_StopWords();
$stopWordsFilter->loadFromFile($my_stopwords_file);
$analyzer =
new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive();
$analyzer->addFilter($stopWordsFilter);
Zend_Search_Lucene_Analysis_Analyzer::setDefault($analyzer);
This file should be a common text file with one word in each line. The '#' character marks a line as a comment.
The Zend_Search_Lucene_Analysis_TokenFilter_ShortWords
constructor has one optional argument. This is the word length limit, set by default to
2.




