简体   繁体   English

如何优化算法以更快地找到带有fuzzywuzzy的相似字符串?

[英]How to optimize an algorithm to find similar strings with fuzzywuzzy faster?

I have a problem finding similar names of food in my database (there are about 100k products names).我在我的数据库中找到相似的食物名称时遇到问题(大约有 10 万个产品名称)。 I've decided to use fuzz.token_sort_ratio from lib fuzzywuzzy to find similar product names.我已经决定使用fuzz.token_sort_ratio从LIB fuzzywuzzy找到类似的产品名称。 This is how it is works:这是它的工作原理:

s1 = 'Pepsi Light'
s2 = 'Light Pepsi'
fuzz.token_sort_ratio(s1, s2)

100 100

Now I want to find all the names of products with similar words, which have the result of fuzz.token_sort_ratio >= 90 Here my code:现在我想找到所有具有相似词的产品名称,其结果为fuzz.token_sort_ratio >= 90 这是我的代码:

#Find similar
start=datetime.now()
l = list(v_foods.name[0:20000])
i=0
df = pd.DataFrame(columns=['name1', 'name2', 'probab_same'])
for k in range(len(l)):
    for s in range(k+1,len(l)):
        probability = fuzz.token_sort_ratio(l[k], l[s])
        if  probability >= 90:
            df.loc[i] = [l[k], l[s], probability]
            i +=1
print('Spent time: {}' .format(datetime.now() - start))           
df.head(5)   

It takes a lot of time.这需要很多时间。 The more products I have, the more time it takes我拥有的产品越多,花费的时间就越多

  1. l = list(v_foods.name[0:5000]) Spent time: ~3 minutes l = list(v_foods.name[0:5000])花费时间:~3 分钟
  2. l = list(v_foods.name[0:10000]) Spent time: ~13 minutes l = list(v_foods.name[0:10000])花费时间:~13 分钟
  3. l = list(v_foods.name[0:20000]) Spent time: ~53 minutes l = list(v_foods.name[0:20000])花费时间:~53 分钟

As I said above, my base has 100k names and it will work very slow.正如我上面所说,我的基地有 10 万个名字,它的工作速度非常慢。 Are there any methods to optimize my algorithm?有什么方法可以优化我的算法吗?

Your problem is that you are comparing each name to each other name.您的问题是您正在将每个名称与其他名称进行比较。 That's n^2 comparisons and so gets slow.那是n^2比较,所以变慢了。 What you need to do is only compare pairs of names that have a chance of being similar enough.您需要做的只是比较有可能足够相似的名称对。

To do better, we need to know what the library is actually doing.为了做得更好,我们需要知道图书馆实际上在做什么。 Thanks to this excellent answer we can tell that.多亏了这个出色的答案,我们才能知道。 What it calls fuzz._process_and_sort(name, True) on both names, then looks for a Levenshtein ratio.它在两个名称上调用fuzz._process_and_sort(name, True) ,然后查找 Levenshtein 比率。 Which is to say that it computes a best way to get from one string to the other, and then calculates 100 * matched_chars / (matched_chars + edits) .也就是说,它计算从一个字符串到另一个字符串的最佳方式,然后计算100 * matched_chars / (matched_chars + edits) For this score to come out to 90+, the number of edits is at most len(name) / 9 .为了使这个分数达到 90+,编辑次数最多为len(name) / 9 (That condition is necessary but not sufficient, if those edits include substitutions and deletions in this string, that lowers the number of matched characters and lowers the ratio.) (该条件是必要但不充分的,如果这些编辑包括此字符串中的替换和删除,则会降低匹配字符的数量并降低比率。)

So you can normalize all of the names quite easily.所以你可以很容易地规范化所有的名字。 The question is can you find for a given normalized name, all the other normalized names at a maximum number of edits from this one?问题是,对于给定的规范化名称,您能否找到所有其他规范化名称的最大编辑次数?

The trick to that is to first put all of your normalized names into a Trie data structure.诀窍是首先将所有规范化名称放入Trie数据结构中。 And then we can walk the Trie in parallel to explore all branches that are within a certain edit distance.然后我们可以并行遍历 Trie 来探索特定编辑距离内的所有分支。 This allows big groups of normalized names that are out of that distance to be dropped without examining them individually.这允许删除超出该距离的大组标准化名称,而无需单独检查它们。

Here is a Python implementation of the Trie that will let you find those pairs of normalized names.这是 Trie 的 Python 实现,可让您找到这些标准化名称对。

import re

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
class Trie:
    def __init__(self, path=''):
        self.strings = []
        self.dict = {}
        self.count_strings = 0
        self.path = path

    def add_string (self, string):
        trie = self

        for letter in string:
            trie.count_strings += 1
            if letter not in trie.dict:
                trie.dict[letter] = Trie(trie.path + letter)
            trie = trie.dict[letter]
        trie.count_strings += 1
        trie.strings.append(string)

    def __hash__ (self):
        return id(self)

    def __repr__ (self):
        answer = self.path + ":\n  count_strings:" + str(self.count_strings) + "\n  strings: " + str(self.strings) + "\n  dict:"
        def indent (string):
            p = re.compile("^(?!:$)", re.M)
            return p.sub("    ", string)
        for letter in sorted(self.dict.keys()):
            subtrie = self.dict[letter]
            answer = answer + indent("\n" + subtrie.__repr__())
        return answer

    def within_edits(self, string, max_edits):
        # This will be all trie/string pos pairs that we have seen
        found = set()
        # This will be all trie/string pos pairs that we start the next edit with
        start_at_edit = set()

        # At distance 0 we start with the base of the trie can match the start of the string.
        start_at_edit.add((self, 0))
        answers = []
        for edits in range(max_edits + 1): # 0..max_edits inclusive
            start_at_next_edit = set()
            todo = list(start_at_edit)
            for trie, pos in todo:
                if (trie, pos) not in found: # Have we processed this?
                    found.add((trie, pos))
                    if pos == len(string):
                        answers.extend(trie.strings) # ANSWERS FOUND HERE!!!
                        # We have to delete from the other string
                        for next_trie in trie.dict.values():
                            start_at_next_edit.add((next_trie, pos))
                    else:
                        # This string could have an insertion
                        start_at_next_edit.add((trie, pos+1))
                        for letter, next_trie in trie.dict.items():
                            # We could have had a a deletion in this string
                            start_at_next_edit.add((next_trie, pos))
                            if letter == string[pos]:
                                todo.append((next_trie, pos+1)) # we matched farther
                            else:
                                # Could have been a substitution
                                start_at_next_edit.add((next_trie, pos+1))
            start_at_edit = start_at_next_edit
        return answers

# Sample useage
trie = Trie()
trie.add_string('foo')
trie.add_string('bar')
trie.add_string('baz')
print(trie.within_edits('ba', 1))

As others pointed out FuzzyWuzzy uses the Levenshtein distance, which is O(N^2).正如其他人指出的那样 FuzzyWuzzy 使用 Levenshtein 距离,即 O(N^2)。 However in your code there are quite a few things that can be optimised to improve the runtime a lot.但是,在您的代码中,有很多可以优化以大大改善运行时的内容。 This will not be as fast as the trie implementation of btilly, but you will keep a similar behaviour (eg sorting the words beforehand)这不会像 btilly 的 trie 实现一样快,但您将保持类似的行为(例如预先对单词进行排序)

  1. use RapidFuzz instead of FuzzyWuzzy (I am the author).使用RapidFuzz而不是 FuzzyWuzzy(我是作者)。 It implements the same algorithms, but it is a lot faster.它实现了相同的算法,但速度要快得多。

  2. your currently preprocessing strings on each call to fuzz.token_sort_ratio, which could be done once beforehand.您当前在每次调用 fuzz.token_sort_ratio 时预处理字符串,这可以预先完成一次。

  3. You can pass your score_cutoff to rapidfuzz, so it can exit early with a score of 0, when it knows the score can not be reached.您可以将 score_cutoff 传递给 Rapidfuzz,因此当它知道无法达到分数时,它可以以 0 的分数提前退出。

The following implementation takes around 47 seconds on my machine, while your current implementation runs about 7 minutes.以下实现在我的机器上大约需要 47 秒,而您当前的实现运行大约 7 分钟。

from rapidfuzz import fuzz, utils
import random
import string
from datetime import datetime
import pandas as pd

random.seed(18)

l = [''.join(random.choice(string.ascii_letters + string.digits + string.whitespace)
       for _ in range(random.randint(10, 20))
    )
    for s in range(10000)
]

start=datetime.now()
processed=[utils.default_process(name) for name in l]
i=0
res = []

for k in range(len(l)):
    for s in range(k+1,len(l)):
        probability = fuzz.token_sort_ratio(
            processed[k], processed[s], processor=False, score_cutoff=90)
        if  probability:
            res.append([l[k], l[s], probability])
            i +=1

df = pd.DataFrame(res, columns=['name1', 'name2', 'probab_same'])

print('Spent time: {}' .format(datetime.now() - start))           
print(df.head(5))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM