Zend_Search_Lucene offers a HTML parsing
feature. Documents can be created directly from a HTML file or
string:
<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
$index->addDocument($doc);
...
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);
Zend_Search_Lucene_Document_Html class uses the
DOMDocument::loadHTML() and
DOMDocument::loadHTMLFile() methods to parse the source
HTML, so it doesn't need HTML to be well formed or
to be XHTML. On the other hand, it's sensitive to the encoding
specified by the "meta http-equiv" header tag.
Zend_Search_Lucene_Document_Html class recognizes document title,
body and document header meta tags.
The 'title' field is actually the /html/head/title value. It's stored within the index, tokenized and available for search.
The 'body' field is the actual body content of the HTML file or string. It doesn't include scripts, comments or attributes.
The loadHTML() and loadHTMLFile()
methods of Zend_Search_Lucene_Document_Html class also have
second optional argument. If it's set to TRUE, then body content is
also stored within index and can be retrieved from the index. By default, the body is
tokenized and indexed, but not stored.
The third parameter of loadHTML() and
loadHTMLFile() methods optionally specifies source
HTML document encoding. It's used if encoding is not specified using
Content-type HTTP-EQUIV meta tag.
Other document header meta tags produce additional document fields. The field 'name' is taken from 'name' attribute, and the 'content' attribute populates the field 'value'. Both are tokenized, indexed and stored, so documents may be searched by their meta tags (for example, by keywords).
Parsed documents may be augmented by the programmer with any other field:
<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
time()));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('updated',
time()));
$doc->addField(Zend_Search_Lucene_Field::Text('annotation',
'Document annotation text'));
$index->addDocument($doc);
Document links are not included in the generated document, but may be retrieved with
the Zend_Search_Lucene_Document_Html::getLinks() and
Zend_Search_Lucene_Document_Html::getHeaderLinks() methods:
<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$linksArray = $doc->getLinks();
$headerLinksArray = $doc->getHeaderLinks();
Starting from Zend Framework 1.6 it's also possible to exclude links with
rel attribute set to 'nofollow'. Use
Zend_Search_Lucene_Document_Html::setExcludeNoFollowLinks($true)
to turn on this option.
Zend_Search_Lucene_Document_Html::getExcludeNoFollowLinks()
method returns current state of "Exclude nofollow links" flag.




