I have a list of strings and I want to filter out the strings that are too similar based on levenstein distance. So if lev(list[0], list[10]) < 50
; then del list[10]
. Is there any way I can calculate such distance between every pair of strings in the list, more efficiently?? Thanks!!
data2= []
for i in data:
for index, j in enumerate(data):
s = levenshtein(i, j)
if s < 50:
del data[index]
data2.append(i)
The rather dumb code above is taking too long to compute...
What if we kept only the indexes of the hit-strings and just skipped them later? I ignore how much enumerate() and del()
weigh and what the percentage of hits is (ie how many strings must be removed from your dataset).
THRESHOLD = 50
data = ["hel", "how", "are", "you"] # replace with your dataset
tbr = {} # holds the index of the strings to be removed
idx = 0
for i in data:
for j in xrange(len(data)):
if j != idx and levenshtein(i, data[j]) < THRESHOLD:
tbr[j] = True
idx += 1
# print tbr
data2 = []
idx = -1
for d in data:
idx += 1
if idx in tbr:
continue # skip this string
data2.append(d)
# print data2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.