I am working for some string matching problems and use fuzzywuzzy module to get score.
My targeted data is around 67K and reference data is almost 4M, I created loop and one iteration is taking around +- 19minutes. Is there any way to make my loop run faster?
%%timeit
df11['NEW'] = ""
for i in range(0, 4):
df11['NEW'] = process.extractOne(df11['Desc 1'][i], df['Description 2'])
df11.head()
assuming:
choice
strings are all relatively long (eg >20 character) and they're not all very similar (eg just one or two characters different)then I'd probably use trigrams to index the strings and then ignore target lines that don't have enough trigrams from the queries
I've been having a play with the "20 newsgroup dataset" and it takes my laptop:
fuzzywuzzy.extractOne
using these lines as the choices/targetthis was after taking:
my code is pretty hacky but I could tidy it up, would probably reduce total runtime to a day or so for all 67k of your query strings, maybe just a few hours if you did this in parallel with multiprocessing
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.