Zend_Search_Lucene works with the UTF-8 charset internally. Index
files store unicode data in Java's "modified UTF-8 encoding".
Zend_Search_Lucene core completely supports this encoding with
one exception.
[15]
Actual input data encoding may be specified through
Zend_Search_Lucene API. Data will be
automatically converted into UTF-8 encoding.
[15]
Zend_Search_Lucene supports only Basic Multilingual Plane
(BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
"supplementary characters" (characters whose code points are
greater than 0xFFFF)
Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters.




