Fastest way to compare large strings in python

Question

I have a dictionary of words with their frequencies as follows.

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

I have a set of strings as follows.

recipes_book = "For today's lesson we will show you how to make biscuit pudding using 
yummy tim tam and fresh milk."

In the above string I have "biscuit pudding", "yummy tim tam" and "fresh milk" from the dictionary.

I am currently tokenizing the string to identify the words in the dictionary as follows.

words = recipes_book.split()
for word in words:
    if word in mydictionary:
        print("Match Found!")

However it only works for one word dictionary keys. Hence, I am interested in the fastest way (because my real recipes are very large texts) to identify the dictionary keys with more than one word. Please help me.

Answer 1

Build up your regex and compile it.

import re

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

searcher = re.compile("|".join(mydictionary.keys()), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    mydictionary[match] += 1

Output after this

{'yummy tim tam': 4, 'biscuit pudding': 4, 'chocolates': 5, 'fresh milk': 3}

Answer 2

According to some tests, the "in" keywork is faster than "re" module :

What's a faster operation, re.match/search or str.find?

There is no problem with spaces here. Supposing mydictionary is static (predefined), I think you should probably go for the inverse thing:

for key in mydictionary.iterkeys():
    if key in recipes_book:
        print("Match Found!")
        mydictionary[key] += 1

In python2, using iterkeys you have an iterator and it's a good practice. With python3 you could cycle directly on the dict.

Answer 3

Try the other way around by search the text you want to find in the large chunk of str data.

import re
for item in mydictionary:
    match = re.search(item, recipes_book, flags=re.I | re.S)
    if match:
       start, end = match.span()
       print("Match found for %s between %d and %d character span" % (match.group(0), start, end))

Fastest way to compare large strings in python

Question

3 answers

solution1
2 ACCPTED 2017-10-03 07:19:03

solution2
1 2017-10-03 07:03:49

solution3
0 2017-10-03 07:01:52

Fastest way to compare large strings in python

Question

3 answers

solution1 2 ACCPTED 2017-10-03 07:19:03

solution2 1 2017-10-03 07:03:49

solution3 0 2017-10-03 07:01:52

solution1
2 ACCPTED 2017-10-03 07:19:03

solution2
1 2017-10-03 07:03:49

solution3
0 2017-10-03 07:01:52