简体   繁体   中英

Is there any way to run loop faster?

I am working for some string matching problems and use fuzzywuzzy module to get score.

My targeted data is around 67K and reference data is almost 4M, I created loop and one iteration is taking around +- 19minutes. Is there any way to make my loop run faster?

%%timeit
df11['NEW'] = ""
for i in range(0, 4):
    df11['NEW'] = process.extractOne(df11['Desc 1'][i], df['Description 2'])

df11.head()

assuming:

  1. that the target/ choice strings are all relatively long (eg >20 character) and they're not all very similar (eg just one or two characters different)
  2. the edit distance between the query and "best" target is relatively small (eg <10% characters modified)

then I'd probably use trigrams to index the strings and then ignore target lines that don't have enough trigrams from the queries

I've been having a play with the "20 newsgroup dataset" and it takes my laptop:

  • 45 seconds to run fuzzywuzzy.extractOne using these lines as the choices/target
  • 0.3 seconds to find the nearest string using trigrams

this was after taking:

  1. 6 seconds to load 477948 lines of text from 18828 emails
  2. 15 seconds to turn the lines into a dictionary of 317324 trigrams

my code is pretty hacky but I could tidy it up, would probably reduce total runtime to a day or so for all 67k of your query strings, maybe just a few hours if you did this in parallel with multiprocessing

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM