I created a function to go calculate the Levenshtein distance (L.distance) of two variables in python when using the Levenshtein package. However, I'm getting a TypeError ("distance expected two Strings or two Unicodes") when I try to apply the function. However, both variables I'm using to calculate the L.distance are strings.
I've tried a for loop, then took it out after looking at other script online which implement the L.distance. I created a test dataframe that only uses single words compared against each other, since I thought that could potentially be the issue (I'm comparing company names that may have many words rather than just singular words)
lst=['bear', 'tomato', 'green', 'snake']
lst2 =['baear', 'tomato', 'grean', 'snake']
dftest=pd.DataFrame(list(zip(lst,lst2)), columns =['lst1', 'lst2'])
result= []
def distancefinder(string1, string2):
for string1, string2 in something:
stringdist = lv.distance(string1, string2)
result.append(stringdist)
return (result)
dftest['lv_matchscore'] = distancefinder(dftest.lst1, dftest.lst2)
The expected output is the calculated L.distance of the two variables.
Here's the way you should do it:
# Imports
import pandas as pd
import Levenshtein as lv
lst=['bear', 'tomato', 'green', 'snake']
lst2 =['baear', 'tomato', 'grean', 'snake']
dftest=pd.DataFrame(list(zip(lst,lst2)), columns =['lst1', 'lst2'])
result= []
def distancefinder(lst1, lst2):
# Create the list you will populate with the results
results = []
# Loop through your records (Levenshtein uses strings, not pandas.Series)
for i in range(len(lst1)):
# Calculate the distance
stringdist = lv.distance(lst1[i], lst2[i])
# Append the result
results.append(stringdist)
# Return the results list
return results
dftest['lv_matchscore'] = distancefinder(dftest.lst1, dftest.lst2)
EDIT
for i in range(len(lst1)):
lst1
is the panda.Series you want to compare ( lst2
is the other one) len(lst1)
returns the length of the series as an integer value (in this example, it evaluates to 4) range(len(lst1))
(which would be range(4)
in this case) returns a list of integers, starting from 0 and reaching 3. So: [0, 1, 2, 3] for i in range(len(lst1))
would be for i in [0, 1, 2, 3]
in this case. i
will be used as index to get each element from the series you want to compare. In the first iteration, you will be comparint lst1[0]
and lst2[0]
; in the second, lst1[1]
and lst2[1]
and so on.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.