Harnessing the power of Levenshtein distance to help site search

  • 0
  • May 5, 2011
Douglas Radburn

Douglas Radburn

Head of Technical

Levenshtein distance is a metric for measuring the ‘difference’ between two sequences.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.

An example:

An example of Levenshtein distance can be shown below (taken from Wikipedia’s article on the subject).

  • kitten β†’ sitten (substitution of ‘s’ for ‘k’).
  • sitten β†’ sittin (substitution of ‘i’ for ‘e’).
  • sittin β†’ sitting (insertion of ‘g’ at the end).

So, what is it useful for?

Levenshtein has many practical implementations, but I’d like to concentrate on one that can help users find products or pages on your site.

If you have search functionality on your site, you probably also have some form of tracking that tells you what phrases are being searched for, their frequency, and perhaps snap-shots of what search results are returned. You might notice that there are lots of searches that return no results, and lots of searches that really SHOULD return results, but don’t. Without getting into too much detail, it all really depends on how your data is structured.

Say, for example, your search system is fairly basic, and searches titles, and perhaps descriptions of products when a visitor searches on your site. One of the products you sell might be Coca-Cola, however a visitor might search for something like “Coca Cola” or “CocaCola”, which would yield no results whatsoever (as they don’t match exactly against what you have in the database).

This is where Levenshtein can come in handy. Using the formula, you can take your visitor’s input and when the initial search returns no (or few) results, try and find the nearest match. This means that you can return products that match “Coca-Cola”, even though the visitor searched for “CocaCola”.

How obvious you want to make this, is upto you.

You could for example, simple return no results, but add a line that suggests an alternate search, eg. “Did you mean Coca-Cola?” and link to another search for that phrase. This is obviously going to improve relevant results for your users. They are going to see that “CocaCola” returned no results, and that “Coca-Cola” was the phrase they potentially should have been looking for.

Another way could be to inform the user that no results were found for their search, but reduce the need for any work on their part, and just display the correct results. This is the most effective way of providing more shaped, and relevant results for your users. You probably see this way of working everyday when you search using Google. If you mis-type a word, instead of pointing out your mistake, they’ll simply ‘assume’ you meant something else and provide you the (usually) correct results.

"Misspelt" - Google.co.uk

Okay, it’s useful, but how should I implement it?

Wikipedia provides some interesting pseudo-code that effectively gives an over-view of how the formula works, shown below:

There is no need to worry about having to create the function for this, as many programming languages already have built-in functions to do this. PHP has this covered through its function which is simply levenshtein. So, you can compare two words thus: $lev = levenshtein($word1, $word2);

I find it easier when using Levenshtein for searching, to match against a list of words, rather than trying to search and programmatically check through product details from a database. To make this easier you could add triggers to add/edit/delete products, or a cron task that runs every night or when your server load is at a minimum that will scrape the content from your products.

I use a function similar to the one below to clean up input.

This function takes in some data – the description of a product for example – and breaks it down. Firstly it removes any punctuation, and any other special characters you might want to ignore. Obviously, this should be amended to suit the specific implementation you’re dealing with. It then adds it onto a master list. This function is probably called in a loop from another function that is pulling results from a database.

Using this to build your list of words that are mentioned and relevant to your site means you also won’t be recommending words and phrases that might be spelt correctly, but aren’t actually mentioned anywhere!

You can then put together a simple function that you can run prior to searching your database. Like this:

This is a fairly simplistic example, and can of course be put awry by having misspellings in your actual descriptions/titles. You could easily expand this to add different weightings to whether you need to add letters or subtract letters to make a word, and then what preference you’d give each result before displaying it to your user.

PHP also has some other pretty cool functions that might better suit the environment that you’re implementing the search in. They are:

string soundex ( string $str ) – This function computes the ‘sound’ of a string. That value can then be used to compare the sound of one string against another.

string metaphone ( string $str [, int $phonemes = 0 ] ) – This function is very similar to soundex, however, it knows the basics of English pronunciation. This should probably only be considered if your search fucntionality covers only one language, and that language is English!

int similar_text ( string $first , string $second [, float &$percent ] ) – This function calculates how similar two strings are. Its return value is how many characters are the same in both strings. This may or may not be helpful in this situation as you’d need to add a large amount of logic to process how this would affect your results.

Free of charge. Unsubscribe anytime.