PhpRiot
News Archive
PhpRiot Newsletter
Your Email Address:

More information

Indexing Freeform-Tagged Data

Note: This article was originally published at Planet PHP on 26 June 2012.
Planet PHP

Indexing Freeform-Tagged Data

Dieren, Netherlands Tuesday, June 26th 2012, 09:11 CEST

At last week's MongoDB UK I presented on "Geolocation, Maps and MongoDB". During this talk I presented a few methods on how to store OpenStreetMap's tags on geographical objects. OpenStreetMap's tags are free form and for example for a road can look like:

alt_name: The Strand highway: primary name: Strand oneway: yes postal_code: WC2

For this article, I am using a data extra from OpenStreetMap for just London. There are about 2.2 million items in my database after importing taking up 413MB of storage (without indexes).

Tags as a normal object

The first method for storing into MongoDB basically uses a tags field with each of the OpenStreetMap tag taking up a property in the array:

"_id" : "w1572026939", "type" : NumberLong(2), "tags" : { "alt_name" : "The Strand", "highway" : "primary", "name" : "Strand", "oneway" : "yes", "postal_code" : "WC2" }, "loc" : [ [ -0.125262, 51.5086046 ], a, [ -0.1270609, 51.5076398 ] ]

In order to make looking up tags and values faster, we will need to create some indexes. We would like to look for streets by name:

db.poi.ensureIndex({ 'tags.name': 1 }); db.poi.find({ 'tags.name': 'Strand' });

And also find roads and paths:

db.poi.ensureIndex({ 'tags.highway': 1 }); db.poi.find({ 'tags.highway' : { $exists: 1 } });

As you can see, for every tag that we want to search on we would have to create an index. More indexes make inserts, updates and deletes slower and MongoDB is also limited to 64 indexes. In order to have an index on every tag we need to find another solution. Also note that the second query doesn't even use an index when you check it with calling explain().

Tags as key/value pairs

The second method I introduced was the one where instead of storing the tags as properties of the tags field, I stored objects containing a tag/tagvalue pair each:

"tags_indexed" : [ { "k" : "alt_name", "v" : "The Strand" }, { "k" : "highway", "v" : "primary" }, { "k" : "name", "v" : "Strand" }, { "k" : "oneway", "v" : "yes" }, { "k" : "postal_code", "v" : "WC2" } ],

The index we then can create is:

db.poi.ensureIndex({ 'tags_indexed.k' : 1, 'tags_indexed.v' : 1 });

To answer the same questions as above we have to rewrite the queries. To find all documents where the name is Strand we can not simply do:

db.poi.find({ 'tags_indexed.k' : 'name', 'tags_indexed.v' : 'Strand' });

This query will find all documents where either the key is name or the value is Strand. This includes documents where there is an array element like { "k" : "name", "v" : "Strand" } in the tags_indexed array, but also every other document where there is one "k" : "name" and a "v" : "Strand" element, such as in:

{ "_id" : ObjectId("4fe4c4b044670a41222088b5"), "tags_indexed" : [ { "k" : "addr:housenumber", "v" : "79" }, { "k" : "addr:postcode", "v" : "WC2R 0DW" }, { "k" : "addr:street", "v" : "Strand" }, { "k" : "building", "v" : "yes" }, { "k" : "name", "v" : "Stamp Centre" } ] }

In order to match only documents where there is one array element with both the "k" : "name" and "v" : "Strand" properties, you need to use $elemMatch:

db.poi.find({ 'tags_indexed' : { $elemMatch: { 'k' : 'name', 'v' : 'Strand' } } });

Let's look at the explain output of both variants. In the first case (with 'name': 'Strand'), the explain output looks like:

db.poi.find({ 'tags.name': 'Strand' }).explain(); { "cursor" : "BtreeCursor tags.name_1", "nscanned" : 12, "nscannedObjects" : 12, "n" : 12, "millis" : 1, "nYields" : 0, "nChunkSkips" : 0, "is

Truncated by Planet PHP, read more at the original (another 7358 bytes)