简体   繁体   中英

Better Approach than FuzzyWuzzy?

I'm getting a result in fuzzywuzzy that isn't working as well as hoped. If there is an extra word in the middle, due to the levenshtein difference, the score is lower.

Example:

from fuzzywuzzy import fuzz

score = fuzz.ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

score = fuzz.partial_ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.partial_ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

results: 81 85 71 81

I'm looking for the first pair (Daniel vs. Daniel William) to be the better match than the second pair (Daniel vs. David).

Is there a better approach than fuzzywuzzy to use here?

For your example, you could use token_set_ratio . The code doc says it takes the ratio of the intersection of the tokens and remaining tokens.

from fuzzywuzzy import fuzz

score = fuzz.token_set_ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.token_set_ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

Result:

100
85

I had a similar challenge in using FuzzyWuzzy to compare one list of names to another list of names to identify matches between the lists. The FuzzyWuzzy token_set_ratio scorer didn't work for me because, to use your example, comparing "DANIEL CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" and "DANIEL WILLIAM CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" (partial match of 2 of 3 words vs. identity match of 3 of 3 words) both yield a 100% score. For me, a match of 3 words needed to score higher than a match of 2 of 3.

I ended up using nltk in a Bag-of-Words-like approach. The algorithm in the code below converts multi-word names to lists of distinct words (tokens) and counts matches of words in one list against the other and normalizes the counts to the numbers of words in each list. Because True = 1 and False = 0, a sum() over testing whether an element is in a list works nicely to count the elements of one list in another list .

An identity match of all words scores 1 (100%). Scoring for your comparisons works out as follows:

  • DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT = (2/2 + 2/3)/2 = (5/3)/2 = 0.83
  • DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT = (1/2 + 1/2)/2 = 1/2 = 0.5
    Note that my method ignores word order, which wasn't needed in my case.
     import nltk s1 = 'DANIEL CARTWRIGHT' s2 = ['DANIEL WILLIAM CARTWRIGHT', 'DAVID CARTWRIGHT'] def myScore(lst1, lst2): # calculate score for comparing lists of words c = sum(el in lst1 for el in lst2) if (len(lst1) == 0 or len(lst2) == 0): retval = 0.0 else: retval = 0.5 * (c/len(lst1) + c/len(lst2)) return retval tokens1 = nltk.word_tokenize(s1) for s in s2: tokens2 = nltk.word_tokenize(s) score = myScore(tokens1, tokens2) print(' vs. '.join([s1, s]), ":", str(score))

    Output:

     DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT : 0.8333333333333333 DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT : 0.5
  • The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM