PhpRiot
Become Zend Certified

Prepare for the ZCE exam using our quizzes (web or iPad/iPhone). More info...


When you're ready get 7.5% off your exam voucher using voucher CJQNOV23 at the Zend Store

HTML documents

Zend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string:

<?php
$doc 
Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
$index->addDocument($doc);
...
$doc Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);

Zend_Search_Lucene_Document_Html class uses the DOMDocument::loadHTML() and DOMDocument::loadHTMLFile() methods to parse the source HTML, so it doesn't need HTML to be well formed or to be XHTML. On the other hand, it's sensitive to the encoding specified by the "meta http-equiv" header tag.

Zend_Search_Lucene_Document_Html class recognizes document title, body and document header meta tags.

The 'title' field is actually the /html/head/title value. It's stored within the index, tokenized and available for search.

The 'body' field is the actual body content of the HTML file or string. It doesn't include scripts, comments or attributes.

The loadHTML() and loadHTMLFile() methods of Zend_Search_Lucene_Document_Html class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored.

The third parameter of loadHTML() and loadHTMLFile() methods optionally specifies source HTML document encoding. It's used if encoding is not specified using Content-type HTTP-EQUIV meta tag.

Other document header meta tags produce additional document fields. The field 'name' is taken from 'name' attribute, and the 'content' attribute populates the field 'value'. Both are tokenized, indexed and stored, so documents may be searched by their meta tags (for example, by keywords).

Parsed documents may be augmented by the programmer with any other field:

<?php
$doc 
Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
                                                   
time()));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('updated',
                                                   
time()));
$doc->addField(Zend_Search_Lucene_Field::Text('annotation',
                                              
'Document annotation text'));
$index->addDocument($doc);

Document links are not included in the generated document, but may be retrieved with the Zend_Search_Lucene_Document_Html::getLinks() and Zend_Search_Lucene_Document_Html::getHeaderLinks() methods:

<?php
$doc 
Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$linksArray $doc->getLinks();
$headerLinksArray $doc->getHeaderLinks();

Starting from Zend Framework 1.6 it's also possible to exclude links with rel attribute set to 'nofollow'. Use Zend_Search_Lucene_Document_Html::setExcludeNoFollowLinks($true) to turn on this option.

Zend_Search_Lucene_Document_Html::getExcludeNoFollowLinks() method returns current state of "Exclude nofollow links" flag.

Zend Framework