PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

Fulltext Search with MongoDB and Solr

Note: This article was originally published at Planet PHP on 6 June 2012.
Planet PHP

Fulltext Search with MongoDB and Solr

Berlin, Germany Tuesday, June 5th 2012, 18:03 CEST

In this article I am explaining how you can tie MongoDB and Solr together. With the help of a little helper script in PHP that uses MongoDB's replication features we automatically generate updates in Solr. We will first look at the MongoDB side and then the Solr side.

Replication

MongoDB's replication works by recording all operations done on a database in a log file, called the oplog. The local database contains a collection called oplog.rs that stores all those operations. The oplog is a capped collection which means they have a fixed maximum size. The operations that are stored in the oplog can be quite easily read, as it is just a normal collection in a normal database:

'seta')); $c = $m-local-selectCollection('oplog.rs'); $r = $c-findOne(array('ns' = 'demo.article', 'op' = 'i')); var_dump($r); ?

This script connects to MongoDB and enables a replica set connection (array('replSet' = 'seta')). It then selects the local database and the oplog.rs collection. We then find one document with findOne().

The ns field can be used to restrict for which databases and collections you would want to see the operations. In this case, the findOne() call limits the operations to the demo database and the article collection.

The op field tells you what sort of operation was recorded. In some cases, the records in the oplog are not exactly what they are when you send them through a driver, as they are broken up in parts. A remove() call for example will generate a delete operation for each document in the oplog. In any case, the op types are: i for inserts, u for updates and d for deletes. n is used for notices such as the reconfiguration of the replicaset.

When running the script above, you get an output like:

array(5) { 'ts' = class MongoTimestamp#6 (2) { public $sec = int(1338716530) public $inc = int(1) } 'h' = int(5050582189860592111) 'op' = string(1) "i" 'ns' = string(12) "demo.article" 'o' = array(3) { '_id' = string(8) "w4442243" 'name' = string(16) "Brondesbury Road" 'tags' = array(3) { 'highway' = string(9) "secondary" 'ref' = string(4) "B451" 'source_ref' = string(22) "OS OpenData StreetView" } } }

And an update operation (type u) looks like:

array(6) { 'ts' = class MongoTimestamp#6 (2) { public $sec = int(1338716534) public $inc = int(1) } 'h' = int(644574934620412462) 'op' = string(1) "u" 'ns' = string(12) "demo.article" 'o2' = array(1) { '_id' = string(8) "w4442243" } 'o' = array(1) { '$set' = array(1) { 'source_ref' = string(6) "survey" } } }

Tailable Cursors

One of the cool things about capped collections, is that they support something called a tailable cursor. A tailable cursor does not use an index and can only return documents in natural order. Which means the order into which documents are inserted into a collection. Hence, we can write the following PHP script to connect to the oplog and wait until new operations are made:

'seta')); $c = $m-local-selectCollection('oplog.rs'); $cursor = $c-find(array('ns' = 'demo.article')); $cursor-tailable(true); while (true) { if (!$cursor-hasNext()) { // we've read all the results, exit if ($cursor-dead()) { break; } sleep(1); } else { var_dump($cursor-getNext()); } } ?

Truncated by Planet PHP, read more at the original (another 9517 bytes)