简体   繁体   中英

How to compare strings more efficiently when using fuzzywuzzy?

I have a CSV file with ~20000 words and I'd like to group the words by similarity. To complete such task, I am using the fantastic fuzzywuzzy package, which seems to work really well and achieves exactly what I am looking for with a small dataset (~100 words)

The words are actually brand names, this is a sample output from the small dataset that I just mentioned, where I get the similar brands grouped by name:

[
    ('asos-design', 'asos'), 
    ('m-and-s', 'm-and-s-collection'), 
    ('polo-ralph-lauren', 'ralph-lauren'), 
    ('hugo-boss', 'boss'), 
    ('yves-saint-laurent', 'saint-laurent')
]

Now, my problem with this, is that if I run my current code for the full dataset, it is really slow, and I don't really know how to improve the performance, or how to do it without using 2 for loops.

This is my code.

import csv
from fuzzywuzzy import fuzz

THRESHOLD = 90

possible_matches = []


with open('words.csv', encoding='utf-8') as csvfile:
    words = []
    reader = csv.reader(csvfile)
    for row in reader:
        word, x, y, *rest = row
        words.append(word)

    for i in range(len(words)-1):
        for j in range(i+1, len(words)): 
            if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD:
                possible_matches.append((words[i], words[j]))

        print(i)
    print(possible_matches)

How can I improve the performance?

For 20,000 words, or brands, any approach that compares each word to each other word, ie has quadratic complexity O(n²), may be too slow. For 20,000 it may still be barely acceptable, but for any larger data set it will quickly break down.

Instead, you could try to extract some "feature" from your words and group them accordingly. My first idea was to use a stemmer , but since your words are names rather than real words, this will not work. I don't know how representative your sample data is, but you could try to group the words according to their components separated by - , then get the unique non-trivial groups, and you are done.

words = ['asos-design', 'asos', 'm-and-s', 'm-and-s-collection', 
         'polo-ralph-lauren', 'ralph-lauren', 'hugo-boss', 'boss',
         'yves-saint-laurent', 'saint-laurent']

from collections import defaultdict
parts = defaultdict(list)
for word in words:
    for part in word.split("-"):
        parts[part].append(word)

result = set(tuple(group) for group in parts.values() if len(group) > 1)

Result:

{('asos-design', 'asos'),
 ('hugo-boss', 'boss'),
 ('m-and-s', 'm-and-s-collection'),
 ('polo-ralph-lauren', 'ralph-lauren'),
 ('yves-saint-laurent', 'saint-laurent')}

You might also want to filter out some stop words first, like and , or keep those together with the words around them. This will probably still yield some false-positives, eg with words like polo or collection that may appear with several different brands, but I assume that the same is true for using fuzzywuzzy or similar. A bit of post-processing and manual filtering of the groups may be in order.

Try using list comprehensions instead, it is faster than list.append() method:

with open('words.csv', encoding='utf-8') as csvfile:
    words = [row[0] for row in csv.reader(csvfile)]

    possible_matches = [(words[i], words[j]) for i in range(len(words)-1) for j in range(i+1, len(words)) if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD]

    print(possible_matches)

Unfortunately with this way you can't do a print(i) in each iteration, but assuming you only needed the print(i) for debugging it wouldn't affect your final result.

Converting a loop into a list comprehension is extremely easy, consider you have a loop like this:

for i in iterable_1:
    lst.append(something)

The list comprehension becomes:

lst = [something for i in iterable_1]

For nested loops and conditions, just follow the same logic:

iterable_1:
    iterable_2:
        ...
        some_condition:
            lst.append(something)

# becomes

lst = [something <iterable_1> <iterable_2> ... <some_condition>]

# Or if you have an else clause:

iterable_1:
    ...
    if some_condition:
        lst.append(something)
    else:
        lst.append(something_else)

lst = [something if some_condition else something_else <iterable_1> <iterable_2> ...]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM