Duplicates Detection with ElasticSearch
As I mentioned on Twitter recently, duplicate detection (or de-duping) is one of the most unappreciated problems that the developers of certain types of applications face sooner or later. The applications I'm talking about share two main characteristics: item collection and some sort of social aspect. A prime example of this is Delicious. Users collect items (bookmarks), and, since multiple people can collect the same item (URL), there needs to be duplicate detection, so that it is possible to infer some information from the data, such as how many people saved a certain URL, what are the popular URLs of the day, etc.
In the case of URL bookmarking, de-duping is fairly easy, because usually a URL points to a unique piece of content, so grouping by the URL gives the desired result. There are some complications arising from redirections and URL shorteners, of course, but the general case is not difficult.
De-duping content that does not have a unique identifier for each item is more challenging. We ran into this while working on Mapalong, and I wanted to share the approach that we took to detect duplicates in our data.
Mapalong allows people to bookmark their favorite locations, and a placemark (what we call it) consists of a number of fields such as title, notes, tags, URL, geo coordinates, and a few more. Mapalong is a map-oriented product, and one of the ways to add a placemark is to simply zoom and drag the map to the location you want to add. Predictably, people are not always precise when they add the placemarks this way. The fields that are likely to be inconsistent are the title (e.g., The Invisible Dog versus The Invisible Dog Art Center) and the location.
The title varies, because people often just type the name in rather than copy their placemark from a known business listing. This also means they can misspell the title, use different punctuation, or give the place a title that is special to them rather than the canonical one (e.g., My first kiss instead of Mulholland Drive overlook).
The location variance is mostly due to two reasons: people's recollection of the actual location may be wrong, and the places on the map occupy a certain area, so each person can bookmark it at a different point in that area (such as different entrances to the building, etc.). Extended places like the Google campus or a popular beach can contain multiple placemarks that are equally valid and should be treated as duplicates for some purposes.
I felt that the tags and notes fields could vary too much from one placemark to another to provide adequate matching, and decided to do duplicate detection based only on the title and location. The approach I took relies on ElasticSearch, which is a full-text search engine we use for Mapalong already. It has a powerful and fast query engine and supports geo-based search as well.
Obviously, matching the title exactly, even case-insensitively, was going to get me only halfway, due to the reasons indicated above. To deal with possible spelling errors, punctuation, and missing or extra words, I used the FuzzyLikeThis (FLT) query type, which is a combination of the MoreLikeThis query that retrieves documents similar to the given text and a Levenshtein-based fuzzy search.
To do the initial classification of places into duplicate groups, the basic algorithm is as follows:
- Loop through all the places.
- Use FLT + geo query to find places that are likely to be duplicates.
- Create a duplicate group from the current place and the matches.
- Mark these places as done, so that they don't get added to another group.
Once the groups are built, the subsequent duplicate matching is done only when a user adds a new place; then we simply run the FLT + geo query, determine which duplicate group contains the majority of the results, and add the new place to that group.
FuzzyLikeThis has several useful tuning parameters:max_query_terms Specifies the maximum number of terms that will be included in the generated query min_similarity Excludes the terms that are too dissimilar prefix_length Enforces a common prefix on the fuzzified terms (which greatly affects the speed of the Levenshtein algorithm)
I experimented with different values for max_query_terms, but since the majority of th
Truncated by Planet PHP, read more at the original (another 4466 bytes)