Friday, February 13, 2009

Levenshtein algorithm revisited: 2.5 times faster

First of all one consideration:

JavaScript is SLOW!


Chrome, Safari, Opera, FireFox (Internet Explorer obviously is the only one 2 to 8 times slower!)
It does not matter which tracemonkey or V8 you are using, JS is far away to be fast as C#, C++, C are!

About Levenshtein


If you would like to create a project like BJSpell with a reasonable suggestion, the levenshtein algorithm will trill at some point: it calculates the distance between 1 string to another one. It means that from the word "Gofy" to the word "Goofy", this algo will return 1 character to change replace, or add, before Gofy will become Goofy.

Why Levenshtein


To obtain a suggestion that makes sense, we could use some known algorithm to recognize at least words close to the mistyped one. So far, my suggestion in BJSpell was something truly rude and often useless, but as far as I know there is NO WAY to implement something more clever, using every possible best performances practice.

How to speed up the algorithm


Every "bloody" implementation of this algorithm uses 1 more useless loop to pre-assign values to the multidimensional Array, plus a boring if else instead of ternary operator.
Removing those loops I switched performances from 1.680 seconds to 560, over 100000 computations using C#. The trick is simple: set the Array when you need it!
Here is the code I wrote for JavaScript first and C# after:

// JavaScript
var levenshtein = function(min, split){
// Levenshtein Algorithm Revisited - WebReflection
try{split=!("0")[0]}catch(i){split=true};
return function(a, b){
if(a == b)return 0;
if(!a.length || !b.length)return b.length || a.length;
if(split){a = a.split("");b = b.split("")};
var len1 = a.length + 1,
len2 = b.length + 1,
I = 0,
i = 0,
d = [[0]],
c, j, J;
while(++i < len2)
d[0][i] = i;
i = 0;
while(++i < len1){
J = j = 0;
c = a[I];
d[i] = [i];
while(++j < len2){
d[i][j] = min(d[I][j] + 1, d[i][J] + 1, d[I][J] + (c != b[J]));
++J;
};
++I;
};
return d[len1 - 1][len2 - 1];
}
}(Math.min, false);


// C#
public int levenshtein(string a, string b){
// Levenshtein Algorithm Revisited - WebReflection
if (a == b)
return 0;
if (a.Length == 0 || b.Length == 0)
return a.Length == 0 ? b.Length : a.Length;
int len1 = a.Length + 1,
len2 = b.Length + 1,
I = 0,
i = 0,
c, j, J;
int[,] d = new int[len1, len2];
while(i < len2)
d[0, i] = i++;
i = 0;
while(++i < len1){
J = j = 0;
c = a[I];
d[i, 0] = i;
while(++j < len2){
d[i, j] = Math.Min(Math.Min(d[I, j] + 1, d[i, J] + 1), d[I, J] + (c == b[J] ? 0 : 1));
++J;
};
++I;
};
return d[len1 - 1, len2 - 1];
}


As Summary


Levenshtein algorithm is still too slow if performed via JavaScript but for small dictionaries (up to 5000 words instead of 70000 or more as I tested with the entire en_US vocabulary) is a perfect choice for suggestions.

Have fun with algorithms ... next one? Who knows, probably the soundex :D

No comments:

Post a Comment