Calculating levenshtein distance within a list Python

Question

I have a list of strings and I want to filter out the strings that are too similar based on levenstein distance. So if lev(list[0], list[10]) < 50 ; then del list[10] . Is there any way I can calculate such distance between every pair of strings in the list, more efficiently?? Thanks!!

data2= []
for i in data:
    for index, j in enumerate(data):
        s = levenshtein(i, j)
        if s < 50:
            del data[index]
    data2.append(i)

The rather dumb code above is taking too long to compute...

Answer 1

What if we kept only the indexes of the hit-strings and just skipped them later? I ignore how much enumerate() and del() weigh and what the percentage of hits is (ie how many strings must be removed from your dataset).

THRESHOLD = 50
data = ["hel", "how", "are", "you"] # replace with your dataset

tbr = {} # holds the index of the strings to be removed
idx = 0
for i in data:
    for j in xrange(len(data)):
        if j != idx and levenshtein(i, data[j]) < THRESHOLD:
            tbr[j] = True
    idx += 1

# print tbr
data2 = []
idx = -1
for d in data:
    idx += 1
    if idx in tbr:
        continue # skip this string
    data2.append(d)
# print data2

Calculating levenshtein distance within a list Python

Question

1 answers

solution1
1 2015-04-04 17:34:21

Calculating levenshtein distance within a list Python

Question

1 answers

solution1 1 2015-04-04 17:34:21

solution1
1 2015-04-04 17:34:21