简体   繁体   English

Python的difflib中的SequenceMatcher是否可以提供更有效的方法来计算Levenshtein距离?

[英]Is it possible that the SequenceMatcher in Python's difflib could provide a more efficient way to calculate Levenshtein distance?

Here's the textbook example of the general algorithm to calculate Levenshtein Distance (I've pulled from Magnus Hetland's webite ): 这是计算Levenshtein距离的一般算法的教科书示例(我从Magnus Hetland的网站中提取 ):

def levenshtein(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)

    return current[n]

I was wondering, however, if there might be a more efficient (and potentially more elegant) pure Python implementation that uses difflib's SequenceManager. 然而,我想知道是否有更高效(可能更优雅)的纯Python实现使用difflib的SequenceManager。 After playing around with it, here's what I came up with: 在玩完之后,这就是我想出的:

from difflib import SequenceMatcher as sm

def lev_using_difflib(s1, s2):
    a = b = size = distance = 0
    for m in sm(a=s1, b=s2).get_matching_blocks():
        distance += max(m.a-a, m.b-b) - size
        a, b, size = m
    return distance

I can't come up with a test case where it fails, and the performance seems to be significantly better than the standard algorithm. 我无法想出一个失败的测试用例,而且性能似乎明显优于标准算法。

Here are the results with levenshtein algorithm that relies on difflib: 以下是依赖于difflib的levenshtein算法的结果:

>>> from timeit import Timer
>>> setup = """
... from difflib import SequenceMatcher as sm
... 
... def lev_using_difflib(s1, s2):
...     a = b = size = distance = 0
...     for m in sm(a=s1, b=s2).get_matching_blocks():
...         distance += max(m.a-a, m.b-b) - size
...         a, b, size = m
...     return distance
... 
... strings = [('sunday','saturday'),
...            ('fitting','babysitting'),
...            ('rosettacode','raisethysword')]
... """
>>> stmt = """
... for s in strings:
...     lev_using_difflib(*s)
... """
>>> Timer(stmt, setup).timeit(100000)
36.989389181137085

And here's the standard pure python implementation: 这是标准的纯python实现:

>>> from timeit import Timer
>>> setup2 = """
... def levenshtein(a,b):
...     n, m = len(a), len(b)
...     if n > m:
...         a,b = b,a
...         n,m = m,n
... 
...     current = range(n+1)
...     for i in range(1,m+1):
...         previous, current = current, [i]+[0]*n
...         for j in range(1,n+1):
...             add, delete = previous[j]+1, current[j-1]+1
...             change = previous[j-1]
...             if a[j-1] != b[i-1]:
...                 change = change + 1
...             current[j] = min(add, delete, change)
... 
...     return current[n]
... 
... strings = [('sunday','saturday'),
...            ('fitting','babysitting'),
...            ('rosettacode','raisethysword')]
... """
>>> stmt2 = """
... for s in strings:
...     levenshtein(*s)
... """
>>> Timer(stmt2, setup2).timeit(100000)
55.594768047332764

Is the performance of the algorithm using difflib's SequenceMatcher really better? 使用difflib的SequenceMatcher算法的性能真的更好吗? Or is it relying on a C library that invalidates the comparison completely? 或者它是否依赖于一个完全使比较无效的C库? If it is relying on C extensions, how can I tell by looking at the difflib.py implementation? 如果它依赖于C扩展,我如何通过查看difflib.py实现来判断?

Using Python 2.7.3 [GCC 4.2.1 (Apple Inc. build 5666)] 使用Python 2.7.3 [GCC 4.2.1(Apple Inc. build 5666)]

Thanks in advance for your help! 在此先感谢您的帮助!

>>> levenshtein('hello', 'world')
4
>>> lev_using_difflib('hello', 'world')
5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM