Cython Damerau-Levenshtein提速

Question

根据这篇Wikipedia的文章，我有以下cython实现来计算2个字符串的Damerau–Levenshtein距离，但是目前它对于我的需求来说太慢了。 我有大约600000个字符串的列表，我必须在该列表中找到错别字。

如果有人可以提出任何算法上的改进或减少脚本运行时间的python / cython魔术，我将很高兴。 我并不真正在乎它仅占用计算时间所占用的空间。

根据使用大约2000个字符串对脚本进行概要分析，它在damerauLevenshteinDistance函数中花费了完整运行时的80％（30秒中的24个），我不知道如何使它更快。

def damerauLevenshteinDistance(a, b, h):
    """
    a = source sequence
    b = comparing sequence
    h = matrix to store the metrics (currently nested list)
    """
    cdef int inf,lena,lenb,i,j,x,i1,j1,d,db
    alphabet = getAlphabet((a,b))
    lena = len(a)
    lenb = len(b)
    inf = lena + lenb + 1
    da = [0 for x in xrange(0, len(alphabet))]
    for i in xrange(1, lena+1):
        db = 0
        for j in xrange(1, lenb+1):
            i1 = da[alphabet[b[j-1]]]
            j1 = db
            d = 1
            if (a[i-1] == b[j-1]):
                d = 0
                db = j
            h[i+1][j+1] = min(
                h[i][j]+d,
                h[i+1][j]+1,
                h[i][j+1]+1,
                h[i1][j1]+(i-i1-1)+1+(j-j1-1)
            )
        da[alphabet[a[i-1]]] = i
    return h[lena+1][lenb+1]

cdef getAlphabet(words):
    """
    construct an alphabet out of the lists found in the tuple words with a
    sequential identifier for each word
    """
    cdef int i
    alphabet = {}
    i = 0
    for wordList in words:
        for letter in wordList:
            if letter not in alphabet:
                alphabet[letter] = i
                i += 1
    return alphabet

Answer 1

如果在搜索中返回多个单词（如果您需要为相同的输入字符串值多次计算Damerau Levenshtein距离），则可以考虑使用字典（或哈希图）来缓存结果。 这是C＃中的实现：

    private static Dictionary<int, Dictionary<int, int>> DamerauLevenshteinDictionary = new Dictionary<int, Dictionary<int, int>>();

    public static int DamerauLevenshteinDistanceWithDictionaryCaching(string word1, string word2)
    {
        Dictionary<int, int> word1Dictionary;

        if (DamerauLevenshteinDictionary.TryGetValue(word1.GetHashCode(), out word1Dictionary))
        {
            int distance;

            if (word1Dictionary.TryGetValue(word2.GetHashCode(), out distance))
            {
                // The distance is already in the dictionary
                return distance;
            }
            else
            {
                // The word1 has been found in the dictionary, but the matching with word2 hasn't been found.
                distance = DamerauLevenshteinDistance(word1, word2);
                DamerauLevenshteinDictionary[word1.GetHashCode()].Add(word2.GetHashCode(), distance);
                return distance;
            }
        }
        else
        {
            // The word1 hasn't been found in the dictionary, we must add an entry to the dictionary with that match.
            int distance = DamerauLevenshteinDistance(word1, word2);
            Dictionary<int, int> dictionaryToAdd = new Dictionary<int,int>();
            dictionaryToAdd.Add(word2.GetHashCode(), distance);
            DamerauLevenshteinDictionary.Add(word1.GetHashCode(), dictionaryToAdd);
            return distance;
        }
    }

Answer 2

我最近刚刚开源了Damerau-Levenshtein算法的Cython实现。 我同时包含pyx和C源代码。

https://github.com/gfairchild/pyxDamerauLevenshtein

Answer 3

至少对于较长的字符串，应该使用不必计算lena⋅lenb矩阵中所有值的其他算法来获得更好的性能。 例如，它可能常常没有必要计算的精确成本[lena][0]的矩阵，其表示通过删除所有字符中开始的成本的角落a 。

更好的算法可能是始终查看迄今为止计算出的最低权重的点，然后从那里向各个方向进一步走一步。 这样，您可能无需检查矩阵中的所有位置即可到达目标位置：

该算法的实现可以使用优先级队列，如下所示：

from heapq import heappop, heappush

def distance(a, b):
   pq = [(0,0,0)]
   lena = len(a)
   lenb = len(b)
   while True:
      (wgh, i, j) = heappop(pq)
      if i == lena and j == lenb:
         return wgh
      if i < lena:
         # deleted
         heappush(pq, (wgh+1, i+1, j))
      if j < lenb:
         # inserted
         heappush(pq, (wgh+1, i, j+1))
      if i < lena and j < lenb:
         if a[i] == b[i]:
            # unchanged
            heappush(pq, (wgh, i+1, j+1))
         else:
            # changed
            heappush(pq, (wgh+1, i+1, j+1))
      # ... more possibilities for changes, like your "+(i-i1-1)+1+(j-j1-1)"

这只是一个粗略的实现，可以进行很多改进：

将新坐标添加到队列时，请检查：
- 如果坐标之前已经处理过，请不要再添加
- 如果坐标当前在队列中，请仅使实例具有更好的附加权重
使用在C中实现的优先级队列，而不是heapq模块

Answer 4

看来您可以静态键入比当前更多的代码，这可以提高速度。

您也可以以Cython为例检查Levenshtein Distance的实现： http ://hackmap.blogspot.com/2008/04/levenshtein-in-cython.html

Answer 5

我的猜测是，您当前代码中的最大改进将来自使用C数组而不是h矩阵的列表列表。

Answer 6

通过“ cython -a”运行它，这将为您提供带有黄色注释行的HTML注释源版本。 颜色越深，该行中发生的Python操作越多。 这通常有助于找到耗时的对象转换等。

但是，我可以肯定，最大的问题就是您的数据结构。 考虑使用NumPy数组而不是嵌套列表，或者仅使用动态分配的C存储块。

Cython Damerau-Levenshtein提速

问题描述

6 个解决方案

解决方案1
1 2011-10-14 15:30:11

解决方案2
0 2013-07-11 20:05:18

解决方案3
0 已采纳 2011-04-07 13:21:10

解决方案4
0 2011-04-07 13:23:19

解决方案5
0 2011-04-07 13:27:36

解决方案6
0 2011-04-16 15:03:10

Cython Damerau-Levenshtein提速

问题描述

6 个解决方案

解决方案1 1 2011-10-14 15:30:11

解决方案2 0 2013-07-11 20:05:18

解决方案3 0 已采纳 2011-04-07 13:21:10

解决方案4 0 2011-04-07 13:23:19

解决方案5 0 2011-04-07 13:27:36

解决方案6 0 2011-04-16 15:03:10

解决方案1
1 2011-10-14 15:30:11

解决方案2
0 2013-07-11 20:05:18

解决方案3
0 已采纳 2011-04-07 13:21:10

解决方案4
0 2011-04-07 13:23:19

解决方案5
0 2011-04-07 13:27:36

解决方案6
0 2011-04-16 15:03:10