简体   繁体   中英

How to optimize an algorithm to find similar strings with fuzzywuzzy faster?

I have a problem finding similar names of food in my database (there are about 100k products names). I've decided to use fuzz.token_sort_ratio from lib fuzzywuzzy to find similar product names. This is how it is works:

s1 = 'Pepsi Light'
s2 = 'Light Pepsi'
fuzz.token_sort_ratio(s1, s2)

100

Now I want to find all the names of products with similar words, which have the result of fuzz.token_sort_ratio >= 90 Here my code:

#Find similar
start=datetime.now()
l = list(v_foods.name[0:20000])
i=0
df = pd.DataFrame(columns=['name1', 'name2', 'probab_same'])
for k in range(len(l)):
    for s in range(k+1,len(l)):
        probability = fuzz.token_sort_ratio(l[k], l[s])
        if  probability >= 90:
            df.loc[i] = [l[k], l[s], probability]
            i +=1
print('Spent time: {}' .format(datetime.now() - start))           
df.head(5)   

It takes a lot of time. The more products I have, the more time it takes

  1. l = list(v_foods.name[0:5000]) Spent time: ~3 minutes
  2. l = list(v_foods.name[0:10000]) Spent time: ~13 minutes
  3. l = list(v_foods.name[0:20000]) Spent time: ~53 minutes

As I said above, my base has 100k names and it will work very slow. Are there any methods to optimize my algorithm?

Your problem is that you are comparing each name to each other name. That's n^2 comparisons and so gets slow. What you need to do is only compare pairs of names that have a chance of being similar enough.

To do better, we need to know what the library is actually doing. Thanks to this excellent answer we can tell that. What it calls fuzz._process_and_sort(name, True) on both names, then looks for a Levenshtein ratio. Which is to say that it computes a best way to get from one string to the other, and then calculates 100 * matched_chars / (matched_chars + edits) . For this score to come out to 90+, the number of edits is at most len(name) / 9 . (That condition is necessary but not sufficient, if those edits include substitutions and deletions in this string, that lowers the number of matched characters and lowers the ratio.)

So you can normalize all of the names quite easily. The question is can you find for a given normalized name, all the other normalized names at a maximum number of edits from this one?

The trick to that is to first put all of your normalized names into a Trie data structure. And then we can walk the Trie in parallel to explore all branches that are within a certain edit distance. This allows big groups of normalized names that are out of that distance to be dropped without examining them individually.

Here is a Python implementation of the Trie that will let you find those pairs of normalized names.

import re

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
class Trie:
    def __init__(self, path=''):
        self.strings = []
        self.dict = {}
        self.count_strings = 0
        self.path = path

    def add_string (self, string):
        trie = self

        for letter in string:
            trie.count_strings += 1
            if letter not in trie.dict:
                trie.dict[letter] = Trie(trie.path + letter)
            trie = trie.dict[letter]
        trie.count_strings += 1
        trie.strings.append(string)

    def __hash__ (self):
        return id(self)

    def __repr__ (self):
        answer = self.path + ":\n  count_strings:" + str(self.count_strings) + "\n  strings: " + str(self.strings) + "\n  dict:"
        def indent (string):
            p = re.compile("^(?!:$)", re.M)
            return p.sub("    ", string)
        for letter in sorted(self.dict.keys()):
            subtrie = self.dict[letter]
            answer = answer + indent("\n" + subtrie.__repr__())
        return answer

    def within_edits(self, string, max_edits):
        # This will be all trie/string pos pairs that we have seen
        found = set()
        # This will be all trie/string pos pairs that we start the next edit with
        start_at_edit = set()

        # At distance 0 we start with the base of the trie can match the start of the string.
        start_at_edit.add((self, 0))
        answers = []
        for edits in range(max_edits + 1): # 0..max_edits inclusive
            start_at_next_edit = set()
            todo = list(start_at_edit)
            for trie, pos in todo:
                if (trie, pos) not in found: # Have we processed this?
                    found.add((trie, pos))
                    if pos == len(string):
                        answers.extend(trie.strings) # ANSWERS FOUND HERE!!!
                        # We have to delete from the other string
                        for next_trie in trie.dict.values():
                            start_at_next_edit.add((next_trie, pos))
                    else:
                        # This string could have an insertion
                        start_at_next_edit.add((trie, pos+1))
                        for letter, next_trie in trie.dict.items():
                            # We could have had a a deletion in this string
                            start_at_next_edit.add((next_trie, pos))
                            if letter == string[pos]:
                                todo.append((next_trie, pos+1)) # we matched farther
                            else:
                                # Could have been a substitution
                                start_at_next_edit.add((next_trie, pos+1))
            start_at_edit = start_at_next_edit
        return answers

# Sample useage
trie = Trie()
trie.add_string('foo')
trie.add_string('bar')
trie.add_string('baz')
print(trie.within_edits('ba', 1))

As others pointed out FuzzyWuzzy uses the Levenshtein distance, which is O(N^2). However in your code there are quite a few things that can be optimised to improve the runtime a lot. This will not be as fast as the trie implementation of btilly, but you will keep a similar behaviour (eg sorting the words beforehand)

  1. use RapidFuzz instead of FuzzyWuzzy (I am the author). It implements the same algorithms, but it is a lot faster.

  2. your currently preprocessing strings on each call to fuzz.token_sort_ratio, which could be done once beforehand.

  3. You can pass your score_cutoff to rapidfuzz, so it can exit early with a score of 0, when it knows the score can not be reached.

The following implementation takes around 47 seconds on my machine, while your current implementation runs about 7 minutes.

from rapidfuzz import fuzz, utils
import random
import string
from datetime import datetime
import pandas as pd

random.seed(18)

l = [''.join(random.choice(string.ascii_letters + string.digits + string.whitespace)
       for _ in range(random.randint(10, 20))
    )
    for s in range(10000)
]

start=datetime.now()
processed=[utils.default_process(name) for name in l]
i=0
res = []

for k in range(len(l)):
    for s in range(k+1,len(l)):
        probability = fuzz.token_sort_ratio(
            processed[k], processed[s], processor=False, score_cutoff=90)
        if  probability:
            res.append([l[k], l[s], probability])
            i +=1

df = pd.DataFrame(res, columns=['name1', 'name2', 'probab_same'])

print('Spent time: {}' .format(datetime.now() - start))           
print(df.head(5))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM