简体   繁体   中英

Optimizing finding matching substring between the two lists by regex in Python

Here is my approach to find the substrings in a list containing 'phrases' by being searched by a list containing 'words' and return the matching substrings which were found in each element in a list containing phrases.

import re

def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']

to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
print(to_be_appended)

# (desired and actual) output
[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

Since 'words' (or list_to_search) list has ~1700 words and 'phrases' (or list_to_be_searched) list has ~26561, it takes over 30 minutes to finish the code. I don't think my code above was implemented considering the Pythonic way of coding and efficient data structure. :(

Could anyone give some advice to optimize or make it faster?

Thanks!

Actually, I wrote the wrong example above. What if the 'list_to_search' has elements with more than 2 words?

import re

def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']

to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
print(to_be_appended)
# (desired and actual) output
[['hello my'],
 ['name', 'is'],
 ['name', 'is'],
 [],
 ['name', 'is', 'is your name', 'your'],
 ['name', 'is']]

Timing 1st Method:

%%timeit
def is_phrase_in(phrase, text):
    return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None

    list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
    list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
    searched = []
    for word in list_to_search:
        if is_phrase_in(word,phrase) is True:
            searched.append(word)
    to_be_appended.append(searched)
#43.2 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

2nd Method (Nested list comprehension and re.findall)

%%timeit
[[j for j in list_to_search if j in re.findall(r"\b{}\b".format(j), i)] for i in list_to_be_searched]
#40.3 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\

Timing definitely has improved but would there be a faster way? Or, the task is genetically slow considering what it does?

You can use a nested list comprehension:

list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name',
                       'how are you', 'what is your name', 'my name is jane doe']

[[j for j in list_to_search if j in i.split()] for i in list_to_be_searched]

[['my'],
 ['name', 'is'],
 ['name', 'is'],
 ['you'],
 ['name', 'is', 'your'],
 ['my', 'name', 'is']]

While the most straightforward / clear approach is to use list comprehensions, I wanted to see if regex could do it better.

Using regex on every item in list_to_be_searched didn't seem to be have any performance gains. But joining the list_to_be_searched into a big block of text and matching it with a regex pattern constructed from list_to_search , there was a slight increase in performance:

In [1]: import re
   ...:
   ...: list_to_search = ['my', 'name', 'is', 'you', 'your']
   ...: list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
   ...:
   ...: def simple_method(to_search, to_be_searched):
   ...:   return [[j for j in to_search if j in i.split()] for i in to_be_searched]
   ...:
   ...: def regex_method(to_search, to_be_searched):
   ...:   word = re.compile(r'(\b(?:' + r'|'.join(to_search) + r')\b(?:\n)?)')
   ...:   blob = '\n'.join(to_be_searched)
   ...:   phrases = word.findall(blob)
   ...:   return [phrase.split(' ') for phrase in ' '.join(phrases).split('\n ')]
   ...:
   ...: def alternate_regex_method(to_search, to_be_searched):
   ...:   word = re.compile(r'(\b(?:' + r'|'.join(to_search) + r')\b(?:\n)?)')
   ...:   phrases = []
   ...:   for item in to_be_searched:
   ...:     phrases.append(word.findall(item))
   ...:   return phrases
   ...:

In [2]: %timeit -n 100 simple_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 23.1 µs per loop

In [3]: %timeit -n 100 regex_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 18.6 µs per loop

In [4]: %timeit -n 100 alternate_regex_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 23.4 µs per loop

To see how this performed under large inputs, I used the 1000 most frequent words in English 1 taken one word at a time as list_to_search , and the entire text of David Copperfield from Project Gutenberg 2 taken one line at a time as list_to_be_searched :

In [5]: book = open('/tmp/copperfield.txt', 'r+')

In [6]: list_to_be_searched = [line for line in book]

In [7]: len(list_to_be_searched)
Out[7]: 38589

In [8]: words = open('/tmp/words.txt', 'r+')

In [9]: list_to_search = [word for word in words]

In [10]: len(list_to_search)
Out[10]: 1000

Here are the results:

In [15]: %timeit -n 10 simple_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 31.9 s per loop

In [16]: %timeit -n 10 regex_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 4.28 s per loop

In [17]: %timeit -n 10 alternate_regex_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 4.43 s per loop

So, go with either of the regex methods if you are keen on performance. Hope that helped! :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM