简体   繁体   中英

Function for levenshtein distance calculations

I created a function to go calculate the Levenshtein distance (L.distance) of two variables in python when using the Levenshtein package. However, I'm getting a TypeError ("distance expected two Strings or two Unicodes") when I try to apply the function. However, both variables I'm using to calculate the L.distance are strings.

I've tried a for loop, then took it out after looking at other script online which implement the L.distance. I created a test dataframe that only uses single words compared against each other, since I thought that could potentially be the issue (I'm comparing company names that may have many words rather than just singular words)

lst=['bear', 'tomato', 'green', 'snake']
lst2 =['baear', 'tomato', 'grean', 'snake']
dftest=pd.DataFrame(list(zip(lst,lst2)), columns =['lst1', 'lst2'])

result= []
def distancefinder(string1, string2):
    for string1, string2 in something:
        stringdist = lv.distance(string1, string2)
        result.append(stringdist)
    return (result)
dftest['lv_matchscore'] = distancefinder(dftest.lst1, dftest.lst2)

The expected output is the calculated L.distance of the two variables.

Here's the way you should do it:

# Imports
import pandas as pd
import Levenshtein as lv

lst=['bear', 'tomato', 'green', 'snake']
lst2 =['baear', 'tomato', 'grean', 'snake']
dftest=pd.DataFrame(list(zip(lst,lst2)), columns =['lst1', 'lst2'])

result= []
def distancefinder(lst1, lst2):
    # Create the list you will populate with the results
    results = []
    # Loop through your records (Levenshtein uses strings, not pandas.Series)
    for i in range(len(lst1)):
        # Calculate the distance 
        stringdist = lv.distance(lst1[i], lst2[i])
        # Append the result
        results.append(stringdist)
    # Return the results list
    return results
dftest['lv_matchscore'] = distancefinder(dftest.lst1, dftest.lst2)

EDIT

for i in range(len(lst1)):

  • lst1 is the panda.Series you want to compare ( lst2 is the other one)
  • len(lst1) returns the length of the series as an integer value (in this example, it evaluates to 4)
  • range(len(lst1)) (which would be range(4) in this case) returns a list of integers, starting from 0 and reaching 3. So: [0, 1, 2, 3]
  • for i in range(len(lst1)) would be for i in [0, 1, 2, 3] in this case. i will be used as index to get each element from the series you want to compare. In the first iteration, you will be comparint lst1[0] and lst2[0] ; in the second, lst1[1] and lst2[1] and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM