简体   繁体   中英

Optimize element wise fuzzy match between two lists

I have two lists of companies (> 2k entries in the longer list) in different formats that I need to unify. I know that both formats share a stub about 80% of the time, so I'm using fuzzy match to compare both lists:

def get_fuzz_score(str1, str2):

    from fuzzywuzzy import fuzz
    partial_ratio = fuzz.partial_ratio(str1, str2)
    return partial_ratio


a = ['Express Scripts', 'Catamaran Corp', 'Banmedica SA (96.7892%)', 'WebMD', 'ODC', 'Caremerge LLC (Stake%)']
b = ['Doctor on Demand', 'Catamaran', 'Express Scripts Holding Corp', 'ODC, Inc.', 'WebMD Health Services', 'Banmedica']

for i in b:
    for j in a:
        if get_fuzz_score(i, j) > 80:
            # process

I'd appreciate thoughts on how to optimize this task for performance (eg, not have to use 2 for loops).

first, I would move the import from fuzzywuzzy import fuzz from the function to the start of the file.

Next, it appears that you want to check every element, so you are comparing all2all anyway and I don't see simple workaround that.

If the data are 'nice' than you could do some simple heuristic eg on a first letter (from the examples you've posted - but that depends on the data).

Best regards

Ps I would comment If my score would be high enough.

I assume you installed both fuzzywuzzy AND python-Levenshtein. The installation of the second package failed and therefore i got a message:

warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

You can use itertools.product to create the cartesian product:

from itertools import product
from fuzzywuzzy import fuzz

def get_fuzz_score(str1, str2):
    partial_ratio = fuzz.partial_ratio(str1, str2)
    return partial_ratio


a = ['Express Scripts', 'Catamaran Corp', 'Banmedica SA (96.7892%)', 'WebMD', 'ODC', 'Caremerge LLC (Stake%)']
b = ['Doctor on Demand', 'Catamaran', 'Express Scripts Holding Corp', 'ODC, Inc.', 'WebMD Health Services', 'Banmedica']

for first, second in product(a, b):
    if get_fuzz_score(first, second) > 80:
        # process

If your function get_fuzz_score doesn't grow you can make it obsolete:

from itertools import product
from fuzzywuzzy import fuzz  # 

a = ['Express Scripts', 'Catamaran Corp', 'Banmedica SA (96.7892%)', 'WebMD', 'ODC', 'Caremerge LLC (Stake%)']
b = ['Doctor on Demand', 'Catamaran', 'Express Scripts Holding Corp', 'ODC, Inc.', 'WebMD Health Services', 'Banmedica']

for first, second in product(a, b):
    if fuzz.partial_ratio(first, second) > 80:
        pass  # process

fuzzywuzzy provides a process.extract* family of functions to help with this, eg:

from fuzzywuzzy import process

a = ['Express Scripts', 'Catamaran Corp', 'Banmedica SA (96.7892%)', 'WebMD', 'ODC', 'Caremerge LLC (Stake%)']
b = ['Doctor on Demand', 'Catamaran', 'Express Scripts Holding Corp', 'ODC, Inc.', 'WebMD Health Services', 'Banmedica']

for name in a:
    print(name, process.extract(name, b, limit=3))

will print out each name in a and the three top matches from b .

this is still O(n**2) but because this library is open source code you get to see how extract is defined and maybe just do the preprocessing once rather than every time which would hopefully speed things up a lot

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM