I have a problem finding similar names of food in my database (there are about 100k products names). I've decided to use fuzz.token_sort_ratio
from lib fuzzywuzzy
to find similar product names. This is how it is works:
s1 = 'Pepsi Light'
s2 = 'Light Pepsi'
fuzz.token_sort_ratio(s1, s2)
100
Now I want to find all the names of products with similar words, which have the result of fuzz.token_sort_ratio
>= 90 Here my code:
#Find similar
start=datetime.now()
l = list(v_foods.name[0:20000])
i=0
df = pd.DataFrame(columns=['name1', 'name2', 'probab_same'])
for k in range(len(l)):
for s in range(k+1,len(l)):
probability = fuzz.token_sort_ratio(l[k], l[s])
if probability >= 90:
df.loc[i] = [l[k], l[s], probability]
i +=1
print('Spent time: {}' .format(datetime.now() - start))
df.head(5)
It takes a lot of time. The more products I have, the more time it takes
l = list(v_foods.name[0:5000])
Spent time: ~3 minutes l = list(v_foods.name[0:10000])
Spent time: ~13 minutes l = list(v_foods.name[0:20000])
Spent time: ~53 minutes As I said above, my base has 100k names and it will work very slow. Are there any methods to optimize my algorithm?
Your problem is that you are comparing each name to each other name. That's n^2
comparisons and so gets slow. What you need to do is only compare pairs of names that have a chance of being similar enough.
To do better, we need to know what the library is actually doing. Thanks to this excellent answer we can tell that. What it calls fuzz._process_and_sort(name, True)
on both names, then looks for a Levenshtein ratio. Which is to say that it computes a best way to get from one string to the other, and then calculates 100 * matched_chars / (matched_chars + edits)
. For this score to come out to 90+, the number of edits is at most len(name) / 9
. (That condition is necessary but not sufficient, if those edits include substitutions and deletions in this string, that lowers the number of matched characters and lowers the ratio.)
So you can normalize all of the names quite easily. The question is can you find for a given normalized name, all the other normalized names at a maximum number of edits from this one?
The trick to that is to first put all of your normalized names into a Trie data structure. And then we can walk the Trie in parallel to explore all branches that are within a certain edit distance. This allows big groups of normalized names that are out of that distance to be dropped without examining them individually.
Here is a Python implementation of the Trie that will let you find those pairs of normalized names.
import re
# Now we will build a trie. Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
class Trie:
def __init__(self, path=''):
self.strings = []
self.dict = {}
self.count_strings = 0
self.path = path
def add_string (self, string):
trie = self
for letter in string:
trie.count_strings += 1
if letter not in trie.dict:
trie.dict[letter] = Trie(trie.path + letter)
trie = trie.dict[letter]
trie.count_strings += 1
trie.strings.append(string)
def __hash__ (self):
return id(self)
def __repr__ (self):
answer = self.path + ":\n count_strings:" + str(self.count_strings) + "\n strings: " + str(self.strings) + "\n dict:"
def indent (string):
p = re.compile("^(?!:$)", re.M)
return p.sub(" ", string)
for letter in sorted(self.dict.keys()):
subtrie = self.dict[letter]
answer = answer + indent("\n" + subtrie.__repr__())
return answer
def within_edits(self, string, max_edits):
# This will be all trie/string pos pairs that we have seen
found = set()
# This will be all trie/string pos pairs that we start the next edit with
start_at_edit = set()
# At distance 0 we start with the base of the trie can match the start of the string.
start_at_edit.add((self, 0))
answers = []
for edits in range(max_edits + 1): # 0..max_edits inclusive
start_at_next_edit = set()
todo = list(start_at_edit)
for trie, pos in todo:
if (trie, pos) not in found: # Have we processed this?
found.add((trie, pos))
if pos == len(string):
answers.extend(trie.strings) # ANSWERS FOUND HERE!!!
# We have to delete from the other string
for next_trie in trie.dict.values():
start_at_next_edit.add((next_trie, pos))
else:
# This string could have an insertion
start_at_next_edit.add((trie, pos+1))
for letter, next_trie in trie.dict.items():
# We could have had a a deletion in this string
start_at_next_edit.add((next_trie, pos))
if letter == string[pos]:
todo.append((next_trie, pos+1)) # we matched farther
else:
# Could have been a substitution
start_at_next_edit.add((next_trie, pos+1))
start_at_edit = start_at_next_edit
return answers
# Sample useage
trie = Trie()
trie.add_string('foo')
trie.add_string('bar')
trie.add_string('baz')
print(trie.within_edits('ba', 1))
As others pointed out FuzzyWuzzy uses the Levenshtein distance, which is O(N^2). However in your code there are quite a few things that can be optimised to improve the runtime a lot. This will not be as fast as the trie implementation of btilly, but you will keep a similar behaviour (eg sorting the words beforehand)
use RapidFuzz instead of FuzzyWuzzy (I am the author). It implements the same algorithms, but it is a lot faster.
your currently preprocessing strings on each call to fuzz.token_sort_ratio, which could be done once beforehand.
You can pass your score_cutoff to rapidfuzz, so it can exit early with a score of 0, when it knows the score can not be reached.
The following implementation takes around 47 seconds on my machine, while your current implementation runs about 7 minutes.
from rapidfuzz import fuzz, utils
import random
import string
from datetime import datetime
import pandas as pd
random.seed(18)
l = [''.join(random.choice(string.ascii_letters + string.digits + string.whitespace)
for _ in range(random.randint(10, 20))
)
for s in range(10000)
]
start=datetime.now()
processed=[utils.default_process(name) for name in l]
i=0
res = []
for k in range(len(l)):
for s in range(k+1,len(l)):
probability = fuzz.token_sort_ratio(
processed[k], processed[s], processor=False, score_cutoff=90)
if probability:
res.append([l[k], l[s], probability])
i +=1
df = pd.DataFrame(res, columns=['name1', 'name2', 'probab_same'])
print('Spent time: {}' .format(datetime.now() - start))
print(df.head(5))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.