PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

Why a project switched from Google Search Appliance to Zend_Lucene

Note: This article was originally published at Planet PHP on 14 January 2011.
Planet PHP

Google technology does a good job when searching the wild and treacherous realms of the public internet. However, the commercial Google Search Appliance (GSA) sold for searching intranet websites did not convince me at all. For a client, we first had to integrate the GSA, later we reimplemented search with Zend_Lucene. Some thoughts comparing the two search solutions.

This post became rather lengthy. If you just want the summary of my pro and con for GSA versus Lucene, scroll right to the end :-)

In a project we got to take over, the customer had already bought a GSA (the "cheap" one - only about $20'000). There was a list of wishes from the client how to optimally integrate the appliance into his web sites:

  • Limit access to authorized users
  • Index all content in all languages
  • Filter content by target group (information present as META in the html headers)
  • Show a box with results from their employee directory

GSA Software

The GSA made problems with most of those requests.

When you activate access protection, the GSA makes a HEAD request on the first 20 or so search results for each single search request, to check if that user has the right to see that document. As on our site, there are no individual visibility requirements, we did not need that. But there is no way to deactivate this check, resulting in unnecessary load on the web server. We ended up catching the GSA HEAD request quite early and just send a Not Modified response without further looking into the request.

The GSA completely ignores the language declaration (whether in META or in the attribute or inside the html head) and uses it's own heuristics. This might be fine for public Internet, when you can assume many sites declaring their content to be in the server installation language even if it is not - but in a controlled environment we can make sure those headers are correct. We talked to google support about this, but they could only confirm that its not possible. This was annoying, as the heuristics was wrong, for example when some part of a page content was in another language.

The spider component messed up with some bugs from the web site we needed to index. We found that the same parameter got repeated over and over on an URL. Those cycles led to having the same page indexed many times and the limit of 500'000 indexed pages being filled up. This is of course a bug in the web server, but we found no way to help the GSA not to stumble over it.

Filtering by meta information would work. But we have binary documents like PDF, Word and so on. There was no way to set the meta information for those documents. requiredfields=gsahintview:group1|-gsahintview should trigger a filter to say either we have the meta information with a specific value, or no meta at all. However, Google confirmed that, this combination of filter expressions is not possible. They updated their documentation to at least explain the restrictions.

The only thing that really worked without hassle was the search box. You can configure the GSA to request data from the web server and return an XML fragment that is integrated into the search result page.

Support by Google was a very positive aspect. They answered fast and without fuss, and have been motivated to help. They seemed competent - so I guess when they did not propose alternatives but simply said there is no such feature, there really was no alternative for our feature requests.

GSA Hardware

The google hardware however was a real nuisance. You get the appliance as a standard sized server to put into the rack. Have the hardware locally makes sense. It won't use external bandwith for indexing and you can be more secure about your confidential data. But during the 2 years we used the GSA, there were 3 hardware failures. As part of the setup test, our hoster checks if the system work properly by unplugging the whole system. While this is not good for data of course, the hardware should survive that. The GSA did not and had to be sent for repair. There were two more hardware issues - one was simply a RAM module signaling an error. But as the hoster is not allowed to open the box, even such simple repair took quite a while. Our client did not want to buy more than one Appliance for his system, as they are rather expensive. So you usually do not have a replacement ready. With any other server, the hoster can fix the system rather fast or in the worst case just re-install the system from backups. With the GSA there is no such redundancy.

The GSA is not only closed in on hardware level. You also do not have shell access to the system, so all configuration has to be done in the web interface. Ver

Truncated by Planet PHP, read more at the original (another 6419 bytes)