Indexing performance is a compromise between used resources, indexing time and index quality.
Index quality is completely determined by number of index segments.
Each index segment is entirely independent portion of data. So indexes containing more segments need more memory and time for searching.
Index optimization is a process of merging several segments into a new one. A fully optimized index contains only one segment.
Full index optimization may be performed with the
$index = Zend_Search_Lucene::open($indexPath);
Index optimization works with data streams and doesn't take a lot of memory but does require processor resources and time.
Lucene index segments are not updatable by their nature (the update operation requires the segment file to be completely rewritten). So adding new document(s) to an index always generates a new segment. This, in turn, decreases index quality.
An index auto-optimization process is performed after each segment generation and consists of merging partial segments.
There are three options to control the behavior of auto-optimization (see Index optimization section):
MaxBufferedDocs is the number of documents that can be buffered in memory before a new segment is generated and written to the hard drive.
MaxMergeDocs is the maximum number of documents merged by auto-optimization process into a new segment.
MergeFactor determines how often auto-optimization is performed.
All these options are
properties- not index properties. They affect only current
Zend_Search_Lucene object behavior and may vary for
MaxBufferedDocs doesn't have any effect if you index only one document per script execution. On the other hand, it's very important for batch indexing. Greater values increase indexing performance, but also require more memory.
There is simply no way to calculate the best value for the MaxBufferedDocs parameter because it depends on average document size, the analyzer in use and allowed memory.
A good way to find the right value is to perform several tests with the largest document you expect to be added to the index  . It's a best practice not to use more than a half of the allowed memory.
MaxMergeDocs limits the segment size (in terms of documents). It
therefore also limits auto-optimization time by guaranteeing that the
addDocument() method is not executed more than a certain number
of times. This is very important for interactive applications.
Lowering the MaxMergeDocs parameter also may improve batch indexing performance. Index auto-optimization is an iterative process and is performed from bottom up. Small segments are merged into larger segment, which are in turn merged into even larger segments and so on. Full index optimization is achieved when only one large segment file remains.
Small segments generally decrease index quality. Many small segments may also trigger the "Too many open files" error determined by OS limitations .
in general, background index optimization should be performed for interactive indexing mode and MaxMergeDocs shouldn't be too low for batch indexing.
MergeFactor affects auto-optimization frequency. Lower values increase the quality of unoptimized indexes. Larger values increase indexing performance, but also increase the number of merged segments. This again may trigger the "Too many open files" error.
MergeFactor groups index segments by their size:
Not greater than MaxBufferedDocs.
Greater than MaxBufferedDocs, but not greater than MaxBufferedDocs*MergeFactor.
Greater than MaxBufferedDocs*MergeFactor, but not greater than MaxBufferedDocs*MergeFactor*MergeFactor.
Zend_Search_Lucene checks during each
addDocument() call to see if merging any segments may move the
newly created segment into the next group. If yes, then merging is performed.
So an index with N groups may contain MaxBufferedDocs + (N-1)*MergeFactor segments and contains at least MaxBufferedDocs*MergeFactor(N-1) documents.
This gives good approximation for the number of segments in the index:
NumberOfSegments <= MaxBufferedDocs + MergeFactor*log MergeFactor (NumberOfDocuments/MaxBufferedDocs)
MaxBufferedDocs is determined by allowed memory. This allows for the appropriate merge factor to get a reasonable number of segments.
Tuning the MergeFactor parameter is more effective for batch indexing performance than MaxMergeDocs. But it's also more course-grained. So use the estimation above for tuning MergeFactor, then play with MaxMergeDocs to get best batch indexing performance.