Is there any way to run loop faster?

Question

I am working for some string matching problems and use fuzzywuzzy module to get score.

My targeted data is around 67K and reference data is almost 4M, I created loop and one iteration is taking around +- 19minutes. Is there any way to make my loop run faster?

%%timeit
df11['NEW'] = ""
for i in range(0, 4):
    df11['NEW'] = process.extractOne(df11['Desc 1'][i], df['Description 2'])

df11.head()

Answer 1

assuming:

that the target/ choice strings are all relatively long (eg >20 character) and they're not all very similar (eg just one or two characters different)
the edit distance between the query and "best" target is relatively small (eg <10% characters modified)

then I'd probably use trigrams to index the strings and then ignore target lines that don't have enough trigrams from the queries

I've been having a play with the "20 newsgroup dataset" and it takes my laptop:

45 seconds to run fuzzywuzzy.extractOne using these lines as the choices/target
0.3 seconds to find the nearest string using trigrams

this was after taking:

6 seconds to load 477948 lines of text from 18828 emails
15 seconds to turn the lines into a dictionary of 317324 trigrams

my code is pretty hacky but I could tidy it up, would probably reduce total runtime to a day or so for all 67k of your query strings, maybe just a few hours if you did this in parallel with multiprocessing

Is there any way to run loop faster?

Question

1 answers

solution1
0 2019-11-11 11:49:14

Is there any way to run loop faster?

Question

1 answers

solution1 0 2019-11-11 11:49:14

solution1
0 2019-11-11 11:49:14