PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

Transforming end user queries to Solr

Note: This article was originally published at Planet PHP on 1 July 2010.
Planet PHP

A bit less than a year ago I last did a presentation about a telephone book application where we used SQL to do some fairly advanced filtering over about 30 tables of data. The app generated SQL statements that filled pages, the more terms the more pages, but on a 10k dataset it still came back within a few milliseconds, thanks to a ton of indexing and denormalization tricks (SQL Server is a lot more powerful here than MySQL) I had applied. Now in a more recent project I am dealing with 10M+ dataset running on MySQL and so decided to learn about Solr. Wow, that thing is amazing and way more flexible in terms of query language than I expected. As a result I do not see it any more for just projects that are too big for an RDBMS, but more as the way to do search in general. I have mentioned resolutionfinder.org a few times (used to be called UN-informed.org). Solr is a key piece there and more importantly I am looking to expand the use of Solr query language quite a bit. Actually for those who know, you can already do a lot more powerful queries, something Liip will be investing some more time to make more accessible to end users with some UI tweaks planned in July. But in this blog post I want to talk about a prototype class I threw together (Look ma', I'm using git!) by working ezcSearch to help me in parsing and transforming end user queries into complex Solr queries.

Right now we have 3 input boxes on the resolutionfinder.org start page. The first one lets user input terms which are searched in the document title and the clause content. Here I recently exposed some of the syntax supported by the dismax handler like "+" to require and "-" to prohibit terms. Next is an input box that doesn't directly affect the search filtering. What it does is use auto suggest to build up a list of tag to filter on. The third input box actually is a separate search form entirely which is used for searching for specific documents by document code. My vision is to bring these all together into one input box. So if someone wants to do a fulltext search they could do something like the following:

"security council" +africa tag:malaria code:A/RES/*

This query would be parsed and result in looking for documents that must contain "africa" in the document title or clause content (scoring those that also contain "security council" higher), which are tagged with "malaria" and who's document code started with "A/RES/". This requires quite a bit of transformation. Actually the final queries I want to construct looks like this:

q=tag_ids:23 AND document_code_prefix:(A/RES/) AND (_query_:"{!dismax qf='content document_title' pf='content document_title' mm=0 v=$qq}")&qq="security council" +africa

So what happened here? As you can see I transformed "tag" to "tag_ids" and "code" to "document_code_prefix". These are the internal fields I use in the Solr index. There is actually also a "document_code" field, which I would use if the code would not end with a "*". The "document_code_prefix" just copies the "document_code" and applies an ngram filter to make prefix searches super fast. Also you may notice that instead of "malaria" I am filtering for 23, this is because right now I just index the tag id's and not the names. The next bit is taking all the un-fielded filters and apply a dismax search handler on them. In order to not require fancy escaping I use "v=$qq" to define that the actual search terms are passed in the "qq" GET parameter.

With the above mentioned class this sort of thing becomes fairly simple. I am using the symfony sfSolrPlugin, which provides query construction tools via the sfLuceneCriteria object. All I have to do is use the parser to parse the end user query into tokens and then use a custom term class that uses the sfLuceneCriteria instance to handle the serialization.

= new sfLuceneCrit

Truncated by Planet PHP, read more at the original (another 14835 bytes)