![](/img/trans.png)
[英]Comparison between two lists of strings for finding substring in python
[英]Optimizing finding matching substring between the two lists by regex in Python
这是我的方法,可通过包含“单词”的列表进行搜索来查找包含“短语”的列表中的子字符串,并返回在包含短语的列表中的每个元素中找到的匹配子字符串。
import re
def is_phrase_in(phrase, text):
return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None
list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
searched = []
for word in list_to_search:
if is_phrase_in(word,phrase) is True:
searched.append(word)
to_be_appended.append(searched)
print(to_be_appended)
# (desired and actual) output
[['my'],
['name', 'is'],
['name', 'is'],
['you'],
['name', 'is', 'your'],
['my', 'name', 'is']]
由于“单词”(或list_to_search)列表具有〜1700个单词,而“短语”(或list_to_be_searched)列表具有〜26561,因此需要30分钟以上才能完成代码。 我不认为我的上述代码是在考虑Python的编码方式和高效的数据结构的情况下实现的。 :(
任何人都可以提出一些建议来优化或加快速度吗?
谢谢!
实际上,我在上面写了错误的示例。 如果“ list_to_search”具有两个以上单词的元素怎么办?
import re
def is_phrase_in(phrase, text):
return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None
list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
searched = []
for word in list_to_search:
if is_phrase_in(word,phrase) is True:
searched.append(word)
to_be_appended.append(searched)
print(to_be_appended)
# (desired and actual) output
[['hello my'],
['name', 'is'],
['name', 'is'],
[],
['name', 'is', 'is your name', 'your'],
['name', 'is']]
第一种计时方法:
%%timeit
def is_phrase_in(phrase, text):
return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None
list_to_search = ['hello my', 'name', 'is', 'is your name', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
to_be_appended = []
for phrase in list_to_be_searched:
searched = []
for word in list_to_search:
if is_phrase_in(word,phrase) is True:
searched.append(word)
to_be_appended.append(searched)
#43.2 µs ± 346 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
第二种方法(嵌套列表理解和重新查找)
%%timeit
[[j for j in list_to_search if j in re.findall(r"\b{}\b".format(j), i)] for i in list_to_be_searched]
#40.3 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)\
时机肯定有所改善,但是会有更快的方法吗? 或者,考虑到它的作用,该任务在基因上是缓慢的?
您可以使用嵌套列表理解:
list_to_search = ['my', 'name', 'is', 'you', 'your']
list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name',
'how are you', 'what is your name', 'my name is jane doe']
[[j for j in list_to_search if j in i.split()] for i in list_to_be_searched]
[['my'],
['name', 'is'],
['name', 'is'],
['you'],
['name', 'is', 'your'],
['my', 'name', 'is']]
虽然最直接/最清晰的方法是使用列表推导,但我想看看正则表达式是否可以做得更好。
对list_to_be_searched
每个项目使用正则表达式似乎没有任何性能提升。 但是将list_to_be_searched
连接到一个大文本块中,并将其与从list_to_search
构造的正则表达式模式进行匹配,性能略有提高:
In [1]: import re
...:
...: list_to_search = ['my', 'name', 'is', 'you', 'your']
...: list_to_be_searched = ['hello my', 'name is', 'john doe doe is last name', 'how are you', 'what is your name', 'my name is jane doe']
...:
...: def simple_method(to_search, to_be_searched):
...: return [[j for j in to_search if j in i.split()] for i in to_be_searched]
...:
...: def regex_method(to_search, to_be_searched):
...: word = re.compile(r'(\b(?:' + r'|'.join(to_search) + r')\b(?:\n)?)')
...: blob = '\n'.join(to_be_searched)
...: phrases = word.findall(blob)
...: return [phrase.split(' ') for phrase in ' '.join(phrases).split('\n ')]
...:
...: def alternate_regex_method(to_search, to_be_searched):
...: word = re.compile(r'(\b(?:' + r'|'.join(to_search) + r')\b(?:\n)?)')
...: phrases = []
...: for item in to_be_searched:
...: phrases.append(word.findall(item))
...: return phrases
...:
In [2]: %timeit -n 100 simple_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 23.1 µs per loop
In [3]: %timeit -n 100 regex_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 18.6 µs per loop
In [4]: %timeit -n 100 alternate_regex_method(list_to_search, list_to_be_searched)
100 loops, best of 3: 23.4 µs per loop
为了查看在大量输入下如何完成此操作,我使用英语1中的1000个最常见单词,一次将一个单词用作list_to_search
,而来自Gutenberg 2项目的David Copperfield的整个文本一次将一行用作list_to_be_searched
:
In [5]: book = open('/tmp/copperfield.txt', 'r+')
In [6]: list_to_be_searched = [line for line in book]
In [7]: len(list_to_be_searched)
Out[7]: 38589
In [8]: words = open('/tmp/words.txt', 'r+')
In [9]: list_to_search = [word for word in words]
In [10]: len(list_to_search)
Out[10]: 1000
结果如下:
In [15]: %timeit -n 10 simple_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 31.9 s per loop
In [16]: %timeit -n 10 regex_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 4.28 s per loop
In [17]: %timeit -n 10 alternate_regex_method(list_to_search, list_to_be_searched)
10 loops, best of 3: 4.43 s per loop
因此,如果您热衷于性能,请使用两种正则表达式方法。 希望能有所帮助! :)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.