简体   繁体   中英

Levenshtein distance in python giving only 1 as edit distance

I have a python program to read two lists (one with errors and other with correct data). Every element in my list with error needs to be compared with every element in my correct list. After comparing i get all the edit distance between every compared pair. now i can find the least edit distance for the given error data and thru get my correct data.

I am trying to use levenshtein distance to calculate the edit distance but its returning all edit distance as 1 even which is wrong.

This means the code for calculating levenshtein distance is not correct. I am struggling to find a fix for this. HELP!

My Code

import csv

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)

if __name__ == "__main__":

    with open("all_correct_promo.csv","rb") as file1:
        reader1 = csv.reader(file1)
        correctPromoList = list(reader1)
        #print correctPromoList

    with open("all_extracted_promo.csv","rb") as file2:
        reader2 = csv.reader(file2)
        extractedPromoList = list(reader2)
        #print extractedPromoList

    incorrectPromo = []
    count = 0
    for extracted in extractedPromoList:
        if(extracted not in correctPromoList):
            incorrectPromo.append(extracted)
        else:
            count = count + 1
    #print incorrectPromo

    for promos in incorrectPromo:
        for correctPromo in correctPromoList:
            distance = lev(promos,correctPromo)
            print promos, correctPromo , distance

The implementation is correct. I tested this:

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)


print lev('abcde','bc') # prints 3, which is correct
print lev('abc','bc') # prints 1, which is correct

Your problem, as I noticed by your comments, is probably when you call the method:

a = ['NSP-212690']
b = ['FE SV X']

print lev(a,b) # prints 1 which is incorrect because you are comparing arrays, not strings
print lev(a[0],b[0]) # prints 10, which is correct

So, what you can do is:

Before the call to "lev(a,b)", extract the first element of each array

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)

a = ['NSP-212690']
b = ['FE SV X']
a = a[0] # this is the key part
b = b[0] # and this
print lev(a,b) # prints 10, which is correct

Anyway, I would not recommend you that recursive implementation, because the performance is veeeery poor

I would recommend this implementation instead (source: wikipedia-levenshtein )

def lev(seq1, seq2):
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
    return thisrow[len(seq2) - 1]

or maybe this slightly modified version:

def lev(seq1, seq2):
    if not a: return len(b)
    if not b: return len(a)
    oneago = None
    thisrow = range(1, len(seq2) + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
        for y in xrange(len(seq2)):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
    return thisrow[len(seq2) - 1]

There is a Python package available that implements the levenshtein distance : python-levenshtein

To install it:

pip install python-levenshtein

To use it:

>>> import Levenshtein
>>> string1 = 'dsfjksdjs'
>>> string2 = 'dsfiksjsd'
>>> print Levenshtein.distance(string1, string2)
3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM