简体   繁体   中英

Fastest way to compare large strings in python

I have a dictionary of words with their frequencies as follows.

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

I have a set of strings as follows.

recipes_book = "For today's lesson we will show you how to make biscuit pudding using 
yummy tim tam and fresh milk."

In the above string I have "biscuit pudding", "yummy tim tam" and "fresh milk" from the dictionary.

I am currently tokenizing the string to identify the words in the dictionary as follows.

words = recipes_book.split()
for word in words:
    if word in mydictionary:
        print("Match Found!")

However it only works for one word dictionary keys. Hence, I am interested in the fastest way (because my real recipes are very large texts) to identify the dictionary keys with more than one word. Please help me.

Build up your regex and compile it.

import re

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

searcher = re.compile("|".join(mydictionary.keys()), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    mydictionary[match] += 1

Output after this

{'yummy tim tam': 4, 'biscuit pudding': 4, 'chocolates': 5, 'fresh milk': 3}

According to some tests, the "in" keywork is faster than "re" module :

What's a faster operation, re.match/search or str.find?

There is no problem with spaces here. Supposing mydictionary is static (predefined), I think you should probably go for the inverse thing:

for key in mydictionary.iterkeys():
    if key in recipes_book:
        print("Match Found!")
        mydictionary[key] += 1

In python2, using iterkeys you have an iterator and it's a good practice. With python3 you could cycle directly on the dict.

Try the other way around by search the text you want to find in the large chunk of str data.

import re
for item in mydictionary:
    match = re.search(item, recipes_book, flags=re.I | re.S)
    if match:
       start, end = match.span()
       print("Match found for %s between %d and %d character span" % (match.group(0), start, end))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM