简体   繁体   中英

Implementing Levenshtein distance in python

I have implemented the algorithm, but now I want to find the edit distance for the string which has the shortest edit distance to the others strings.

Here is the algorithm:

def lev(s1, s2):
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)

Your "implementation" has several flaws:

(1) It should start with def lev(a, b): , not def lev(s1, s2): . Please get into the good habits of (a) running your code before asking questions about it (b) quoting the code that you've actually run (by copy/paste, not by (error-prone) re-typing).

(2) It has no termination conditions; for any arguments it will eventually end up trying to evaluate lev("", "") which would loop forever were it not for Python implementation limits: RuntimeError: maximum recursion depth exceeded .

You need to insert two lines:

if not a: return len(b)
if not b: return len(a)

to make it work.

(3) The Levenshtein distance is defined recursively. There is no such thing as "the" (one and only) algorithm. Recursive code is rarely seen outside a classroom and then only in a "strawman" capacity.

(4) Naive implementations take time and memory proportional to len(a) * len(b) ... aren't those strings normally a little bit longer than 4 to 8?

(5) Your extremely naive implementation is worse, because it copies slices of its inputs.

You can find working not-very-naive implementations on the web ... google("levenshtein python") ... look for ones which use O(max(len(a), len(b))) additional memory.

What you asked for ("the edit distance for the string who has the shortest edit distance to the others strings.") Doesn't make sense ... "THE string"??? "It takes two to tango" :-)

What you probably want (finding all pairs of strings in a collection which have the minimal distance), or maybe just that minimal distance, is a simple programming exercise. What have you tried?

By the way, finding those pairs by a simplistic algorithm will take O(N ** 2) executions of lev() where N is the number of strings in the collection ... if this is a real-world application, you should look to use proven code rather than try to write it yourself. If this is homework, you should say so.

is this what you're looking for ??

import itertools
import collections

# My Simple implementation of Levenshtein distance
def levenshtein_distance(string1, string2):
    """
    >>> levenshtein_distance('AATZ', 'AAAZ')
    1
    >>> levenshtein_distance('AATZZZ', 'AAAZ')
    3
    """

    distance = 0

    if len(string1) < len(string2):
        string1, string2 = string2, string1

    for i, v in itertools.izip_longest(string1, string2, fillvalue='-'):
        if i != v:
            distance += 1
    return distance

# Find the string with the shortest edit distance.
list_of_string = ['AATC', 'TAGCGATC', 'ATCGAT']

strings_distances = collections.defaultdict(int)

for strings in itertools.combinations(list_of_string, 2):
    strings_distances[strings[0]] += levenshtein_distance(*strings)
    strings_distances[strings[1]] += levenshtein_distance(*strings)

shortest = min(strings_distances.iteritems(), key=lambda x: x[1])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM