简体   繁体   English

优化字符串匹配算法

[英]Optimizing String Matching Algorithm

function levenshtein(a, b) {
  var i,j,cost,d=[];

  if (a.length == 0) {return b.length;}
  if (b.length == 0) {return a.length;}

  for ( i = 0; i <= a.length; i++) {
    d[i] = new Array();
    d[ i ][0] = i;
  }

  for ( j = 0; j <= b.length; j++) {
    d[ 0 ][j] = j;
  }

  for ( i = 1; i <= a.length; i++) {
    for ( j = 1; j <= b.length; j++) {
      if (a.charAt(i - 1) == b.charAt(j - 1)) {
        cost = 0;
      } else {
        cost = 1;
      }

      d[ i ][j] = Math.min(d[ i - 1 ][j] + 1, d[ i ][j - 1] + 1, d[ i - 1 ][j - 1] + cost);

      if (i > 1 && j > 1 && a.charAt(i - 1) == b.charAt(j - 2) && a.charAt(i - 2) == b.charAt(j - 1)) {
        d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost)
      }
    }
  }

  return d[ a.length ][b.length];
}

function suggests(suggWord) {
  var sArray = [];
  for(var z = words.length;--z;) {
    if(levenshtein(words[z],suggWord) < 2) { 
      sArray.push(words[z]);
    }   
  }
}

Hello. 你好。

I'm using the above implementation of Damerau-Levenshtein algorithm. 我正在使用Damerau-Levenshtein算法的上述实现。 Its fast enough on a normal PC browser, but on a tablet it takes ~2/3 seconds. 它在普通PC浏览器上足够快,但在平板电脑上需要~2 / 3秒。

Basically, I'm comparing the word sent to a suggest function to every word in my dictionary, and if the distance is less than 2 adding it to my array. 基本上,我将发送到建议函数的单词与我的字典中的每个单词进行比较,如果距离小于2则将其添加到我的数组中。

The dic is an array of words approx size 600,000 (699KB) The aim of this is to make a suggest word feature for my Javascript spell checker. dic是一个大小为600,000(699KB)的单词数组。这样做的目的是为我的Javascript拼写检查器制作一个推荐单词功能。

Any suggestion on how to speed this up? 关于如何提高速度的任何建议? Or a different way of doing this? 或者这样做的另一种方式?

One thing you can do if you are only looking for distances less than some threshold is to compare the lengths first. 如果您只寻找小于某个阈值的距离,您可以做的一件事是首先比较长度。 For example, if you only want distances less than 2, then the absolute value of the difference of the two strings' lengths must be less than 2 as well. 例如,如果您只希望距离小于2,那么两个字符串长度的差值的绝对值也必须小于2。 Doing this will often allow you to avoid even doing the more expensive Levenshtein calculation. 这样做通常可以避免进行更昂贵的Levenshtein计算。

The reasoning behind this is that two strings that differ in length by 2, will require at least two insertions (and thus a resulting minimum distance of 2). 这背后的原因是两个长度相差2的字符串将需要至少两次插入(因此产生的最小距离为2)。

You could modify your code as follows: 您可以按如下方式修改代码:

function suggests(suggWord) {
  var sArray = [];
  for(var z = words.length;--z;) {
    if(Math.abs(suggWord.length - words[z].length) < 2) {
      if (levenshtein(words[z],suggWord) < 2) { 
        sArray.push(words[z]);
      }
    }   
  }
}

I don't do very much javascript, but I think this is how you could do it. 我没有做很多javascript,但我认为这是你能做到的。

Part of the problem is that you have a large array of dictionary words, and are doing at least some processing for every one of those words. 部分问题在于您拥有大量字典单词,并且至少对这些单词中的每一个都进行了一些处理。 One idea would be to have a separate array for each different word length, and organize your dictionary words into them instead of one big array (or, if you must have the one big array, for alpha lookups or whatever, then use arrays of indexes into that big array). 一个想法是为每个不同的字长有一个单独的数组,并将字典单词组织成它们而不是一个大数组(或者,如果你必须有一个大数组,用于alpha查找或其他什么,那么使用索引数组进入那个大阵容)。 Then, if you have a suggWord that's 5 characters long, you only have to look through the arrays of 4, 5, and 6 letter words. 然后,如果你有一个5个字符长的suggWord,你只需要查看4个,5个和6个字母单词的数组。 You can then remove the Match.Abs(length-length) test in my code above, because you know you are only looking at the words of the length that could match. 然后,您可以在上面的代码中删除Match.Abs​​(长度 - 长度)测试,因为您知道您只查看可以匹配的长度的单词。 This saves you having to do anything with a large chunk of your dictionary words. 这可以节省您使用大量字典单词所做的任何事情。

Levenshtein is relatively expensive, and more so with longer words. Levenshtein相对较贵,而言语较长则更为昂贵。 If it is simply the case that Levenshtein is too expensive to do very many times, especially with longer words, you may leverage off another side effect of your threshold of only considering words that either exactly match or that have a distance of 1 (one insertion, deletion, substitution, or transposition). 如果简单地说Levenshtein过于昂贵而不能做很多次,特别是用较长的单词,你可以利用你的阈值的另一个副作用,只考虑完全匹配或距离为1的单词(一次插入) ,删除,替换或转置)。 Given that requirement, you can further filter candidates for the Levenshtein calculation by checking that either their first character matches, or their last character matches (unless either word has a length of 1 or 2, in which case Levensthein should be cheap to do). 根据这个要求,你可以通过检查他们的第一个字符匹配或者他们的最后一个字符是否匹配来进一步筛选Levenshtein计算的候选者(除非任何一个字的长度为1或2,在这种情况下Levensthein应该很便宜)。 In fact, you could check for a match of either the first n characters or the last n characters, where n = (suggWord.length-1)/2. 实际上,您可以检查前n个字符或后n个字符的匹配,其中n =(suggWord.length-1)/ 2。 If they don't pass that test, you can assume that they won't match via Levenshtein. 如果他们没有通过该测试,你可以假设他们不会通过Levenshtein匹配。 For this you would want primary array of dictionary words ordered alphabetically, and in addition, an array of indexes into that array, but ordered alphabetically by their reversed characters. 为此,您需要按字母顺序排序字典单词的主要数组,此外,还需要一个索引到该数组中的数组,但按其反转字符按字母顺序排序。 Then you could do a binary search into both of those arrays, and only have to do Levenshtein calculation on the small subset of words whose n characters of their start or end match the suggWord start or end, and that have a length that differs by at most one character. 然后你可以对这两个数组进行二进制搜索,只需对其开头或结尾的n个字符与suggWord开头或结尾匹配的单词的小子集进行Levenshtein计算,并且其长度相差于最多的一个角色。

I had to optimize the same algorithm. 我不得不优化相同的算法。 What worked best for me was to cache the d Array.. you create it with big size (the maximum length of the strings you expect) outside of the levenshtein function, so each time you call the function you don't have to reinitialize it. 对我来说最有效的是缓存d数组..你在levenshtein函数之外创建大尺寸(你期望的字符串的最大长度),所以每次你调用函数你都不必重新初始化它。

In my case, in Ruby, it made a huge difference in performance. 就我而言,在Ruby中,它在性能上产生了巨大的差异。 But of course it depends on the size of your words array... 但当然这取决于你的words数组的大小......

function levenshtein(a, b, d) {

var i,j,cost;

if (a.length == 0) {return b.length;}
if (b.length == 0) {return a.length;}

for ( i = 1; i <= a.length; i++) {

    for ( j = 1; j <= b.length; j++) {

        if (a.charAt(i - 1) == b.charAt(j - 1)) {

            cost = 0;

        } else {

            cost = 1;

        }

        d[ i ][j] = Math.min(d[ i - 1 ][j] + 1, d[ i ][j - 1] + 1, d[ i - 1 ][j - 1] + cost);

        if (i > 1 && j > 1 && a.charAt(i - 1) == b.charAt(j - 2) && a.charAt(i - 2) == b.charAt(j - 1)) {

            d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost)

        }

    }

}

return d[ a.length ][b.length];

}

function suggests(suggWord)
{
d = [];
for ( i = 0; i <= 999; i++) {

    d[i] = new Array();

    d[ i ][0] = i;

}
for ( j = 0; j <= 999; j++) {

    d[ 0 ][j] = j;

}


var sArray = [];
for(var z = words.length;--z;)
{
        if(levenshtein(words[z],suggWord, d) < 2)
        {sArray.push(words[z]);}    
}
}

You should store all the words in a trie . 您应该将所有单词存储在trie中 This is space efficient when compared to dictionary storing words. 与存储单词的字典相比,这是节省空间的。 And the algorithm to match a word would be to traverse the trie (which marks the end of the word) and get to the word. 匹配单词的算法是遍历trie(标记单词的结尾)并转到单词。

Edit 编辑

Like I mentioned in my comment. 就像我在评论中提到的那样。 For Levenshtein distance of 0 or 1 you don't need to go through all the words. 对于Levenshtein距离0或1,您不需要查看所有单词。 Two words have Levenshtein distance of 0 if they are equal. 如果Levenshtein距离相等,则两个单词的距离为0。 Now the problem boils down to predicting all the words which will have Levenshtein distance of 1 for a given word. 现在问题归结为预测给定单词的Levenshtein距离为1的所有单词。 Let's take an example: 我们来举个例子:

array 排列

For the above word if you want to find Levenshtein distance of 1, the examples will be 对于上面的单词,如果你想找到Levenshtein距离1,那么例子就是

  • parray, aprray, arpray, arrpay, arrayp (Insertion of a character) parray,aprray,arpray,arrpay,arrayp(插入一个角色)

Here p can be substituted by any other letter. 这里p可以用任何其他字母代替。

Also for these words, Levenshtein distance is 1 对于这些词,Levenshtein距离是1

rray, aray, arry (Deletion of a character) rray,aray,arry(删掉一个角色)

And finally for these words: 最后是这些话:

prray, apray, arpay, arrpy and arrap (Substitution of a character) prray,apray,arpay,arrpy和arrap(角色的替换)

Here again, p can be substituted with any other letter. 在这里,p可以用任何其他字母代替。

So if you look up for these particular combinations only and not all the words, you will get to your solution. 因此,如果您只查找这些特定组合而不是所有单词,您将获得解决方案。 If you know how a Levenshtein algorithm works, we have reverse engineered it. 如果你知道Levenshtein算法是如何工作的,我们就已经对它进行了逆向工程。

A final example which is your usecase: 最后一个例子是你的用例:

If pary is the word which you get as input and which should be corrected to part from the dictionary. 如果pary是你作为输入,应当予以纠正,以一部分从字典中的单词。 So for pary you don't need to look at words starting with ab for eg because for any word starting with ab , Levenshtein distance will be greater than 1. 因此,对于pary,您不需要查看以ab开头的单词,例如因为任何以ab开头的单词,Levenshtein距离将大于1。

There are some simple things you can do in your code to RADICALLY improve execution speed. 您可以在代码中执行一些简单的操作,从根本上提高执行速度。 I completely rewrote your code for performance, static typing compliance with JIT interpretation, and JSLint compliance: 我完全重写了代码的性能,静态类型符合JIT解释和JSLint合规性:

var levenshtein = function (a, b) {
        "use strict";
        var i = 0,
            j = 0,
            cost = 1,
            d = [],
            x = a.length,
            y = b.length,
            ai = "",
            bj = "",
            xx = x + 1,
            yy = y + 1;
        if (x === 0) {
            return y;
        }
        if (y === 0) {
            return x;
        }
        for (i = 0; i < xx; i += 1) {
            d[i] = [];
            d[i][0] = i;
        }
        for (j = 0; j < yy; j += 1) {
            d[0][j] = j;
        }
        for (i = 1; i < xx; i += 1) {
            for (j = 1; j < yy; j += 1) {
                ai = a.charAt(i - 1);
                bj = b.charAt(j - 1);
                if (ai === bj) {
                    cost = 0;
                } else {
                    cost = 1;
                }
                d[i][j] = Math.min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + cost);
                if (i > 1 && j > 1 && ai === b.charAt(j - 2) && a.charAt(i - 2) === bj) {
                    d[i][j] = Math.min(d[i][j], d[i - 2][j - 2] + cost);
                }
            }
        }
        return d[x][y];
    };

Looking up the length of the array at each interval of a multidimensional lookup is very costly. 在多维查找的每个间隔查找数组的长度是非常昂贵的。 I also beautified your code using http://prettydiff.com/ so that I could read it in half the time. 我还使用http://prettydiff.com/来美化你的代码,这样我就可以在一半的时间内阅读它。 I also removed some redundant look ups in your arrays. 我还删除了数组中的一些冗余查找。 Please let me know if this executes faster for you. 如果这对你来说执行得更快,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM