简体   繁体   中英

Fuzzy matching not accurate enough with TF-IDF and cosine similarity

I want to find similarities in a long list of strings. That is for every one string in the list, I need all similar strings in the same list. Earlier I used Fuzzywuzzy which provided good accuracy with the results I wanted by using the fuzzy.partial_token_sort_ratio. The only problem with this is the time it took since the list contains ~50k entries with up to 40 character strings. Time taken went up to 36 hours for 50k strings.

To improve my time I tried the rapidfuzz library which reduced the time to around 12 hours, giving same output as Fuzzywuzzy inspired from an answer here . Later I tried tf-idf and cosine similarity which gave some fantastic time improvements using the string-grouper library inspired from this blog . Closely investigating the results, the string-grouper method missed matches like 'DARTH VADER' and 'VADER' which were caught by fuzzywuzzy and rapidfuzz. This can be understood because of the way TF-IDF works and it seems to miss small strings altogether. Is there any workaround to improve the matching of string-grouper in this example or improve the time taken by rapidfuzz? Any faster iteration methods? Or any other ways to make the problem work?

The data is preprocessed and contains all strings in CAPS without special characters or numbers.

Time taken per iteration is ~1s. Here is the code for rapidfuzz:

from rapidfuzz import process, utils, fuzz

for index,rows in df.iterrows()
    list.append(process.extract(rows['names'],df['names'],scorer=fuzz.partial_token_set_ratio,score_cutoff=80))

Super fast solution, here is the code for string-grouper:

from string_grouper import match_strings
matches=match_strings(df.['names'])

Some similar problems with fuzzywuzzy are discussed here : ( Fuzzy string matching in Python )

Also in general, are there any other programming languages that I can shift to, like R which can maybe speed this up? Just curious... Thanks for your help 😊

您应该尝试一下tfidf-matcher ,它不适用于我的特定用例,但它可能非常适合您。

It is possible to change the minimum similarity with min_similarity and the size of n-grams with ngram_size in the match_strings function in string-grouper. For the specific example you could use a higher ngram_size, but that might cause you too miss other hits again.

tfidf matcher worked wonderfully for me. No hassle, just one function to call + you can set how many ngrams you'd like to split the word into, and the number of close matches you'd like + a confidence value in the match. It's also fast enough: looking up a string in a dataset of around 230k words took around 3 seconds at most.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM