[英]Python - find occurrences of list of strings within string
I have a large string and a list of search strings and want to build a boolean list indicating whether or not each of the search strings exists in the large string. 我有一个大字符串和一个搜索字符串列表,并希望构建一个布尔列表,指示每个搜索字符串是否存在于大字符串中。 What is the fastest way to do this in Python?
在Python中执行此操作的最快方法是什么?
Below is a toy example using a naive approach, but I think it's likely there's a more efficient way of doing this. 下面是一个使用天真方法的玩具示例,但我认为这可能是一种更有效的方法。
eg the example below should return [1, 1, 0] since both "hello" and "world" exist in the test string. 例如,下面的示例应返回[1,1,0],因为测试字符串中存在“hello”和“world”。
def check_strings(search_list, input):
output = []
for s in search_list:
if input.find(s) > -1:
output.append(1)
else:
output.append(0)
return output
search_strings = ["hello", "world", "goodbye"] test_string = "hello world" print(check_strings(search_strings, test_string))
I can't say if this is the fastest , (this is still O(n*m)), but this is the way I would do it: 我不能说这是否是最快的 (这仍然是O(n * m)),但这是我这样做的方式:
def check_strings(search_list, input_string):
return [s in input_string for s in search_list]
The following program might be faster, or not. 以下程序可能更快或更快。 It uses a regular expression to make one pass through the input string.
它使用正则表达式来传递一个输入字符串。 Note that you may you may want to use
re.escape(i)
in the re.findall()
expression, or not, depending upon your needs. 请注意,您可能希望在
re.findall()
表达式中使用re.escape(i)
,具体取决于您的需要。
def check_strings_re(search_string, input_string):
import re
return [any(l)
for l in
zip(*re.findall('|'.join('('+i+')' for i in search_string),
input_string))]
Here is a complete test program: 这是一个完整的测试程序:
def check_strings(search_list, input_string):
return [s in input_string for s in search_list]
def check_strings_re(search_string, input_string):
import re
return [any(l)
for l in
zip(*re.findall('|'.join('('+i+')' for i in search_string),
input_string))]
search_strings = ["hello", "world", "goodbye"]
test_string = "hello world"
assert check_strings(search_strings, test_string) == [True, True, False]
assert check_strings_re(search_strings, test_string) == [True, True, False]
An implementation using the Aho Corasick algorithm ( https://pypi.python.org/pypi/pyahocorasick/ ), which uses a single pass through the string: 使用Aho Corasick算法( https://pypi.python.org/pypi/pyahocorasick/ )的实现,它使用单个字符串传递:
import ahocorasick
import numpy as np
def check_strings(search_list, input):
A = ahocorasick.Automaton()
for idx, s in enumerate(search_list):
A.add_word(s, (idx, s))
A.make_automaton()
index_list = []
for item in A.iter(input):
index_list.append(item[1][0])
output_list = np.array([0] * len(search_list))
output_list[index_list] = 1
return output_list.tolist()
search_strings = ["hello", "world", "goodbye"]
test_string = "hello world"
print(check_strings(search_strings, test_string))
I post it just for comparison. 我发布它只是为了比较。 My comparing code:
我的比较代码:
#!/usr/bin/env python3
def gettext():
from os import scandir
l = []
for file in scandir('.'):
if file.name.endswith('.txt'):
l.append(open(file.name).read())
return ' '.join(l)
def getsearchterms():
return list(set(open('searchterms').read().split(';')))
def rob(search_string, input_string):
import re
return [any(l)
for l in
zip(*re.findall('|'.join('('+i+')' for i in search_string),
input_string))]
def blotosmetek(search_strings, input_string):
import re
regexp = re.compile('|'.join([re.escape(x) for x in search_strings]))
found = set(regexp.findall(input_string))
return [x in found for x in search_strings]
def ahocorasick(search_list, input):
import ahocorasick
import numpy as np
A = ahocorasick.Automaton()
for idx, s in enumerate(search_list):
A.add_word(s, (idx, s))
A.make_automaton()
index_list = []
for item in A.iter(input):
index_list.append(item[1][0])
output_list = np.array([0] * len(search_list))
output_list[index_list] = 1
return output_list.tolist()
def naive(search_list, text):
return [s in text for s in search_list]
def test(fn, args):
start = datetime.now()
ret = fn(*args)
end = datetime.now()
return (end-start).total_seconds()
if __name__ == '__main__':
from datetime import datetime
text = gettext()
print("Got text, total of", len(text), "characters")
search_strings = getsearchterms()
print("Got search terms, total of", len(search_strings), "words")
fns = [ahocorasick, blotosmetek, naive, rob]
for fn in fns:
r = test(fn, [search_strings, text])
print(fn.__name__, r*1000, "ms")
I used different words that appear in Leviathan as search terms and concatenated 25 most downloaded books from Project Gutenberg as search string. 我使用Leviathan中出现的不同单词作为搜索词,并将来自Project Gutenberg的25个最常下载的书籍连接为搜索字符串。 Results are as follows:
结果如下:
Got text, total of 18252025 characters
Got search terms, total of 12824 words
ahocorasick 3824.111 milliseconds
Błotosmętek 360565.542 milliseconds
naive 73765.67 ms
Robs version runs already for about an hour and still doesn't finish. Robs版本已经运行了大约一个小时但仍未完成。 Maybe it's broken, maybe it's simply painfully slow.
也许它已经坏了,也许它只是非常缓慢。
My version using regular expressions: 我的版本使用正则表达式:
def check_strings(search_strings, input_string):
regexp = re.compile('|'.join([re.escape(x) for x in search_strings]))
found = set(regexp.findall(input_string))
return [x in found for x in search_strings]
On the test data provided by original poster it is by an order of magnitude slower than Rob's pretty solution, but I'm going to do some benchmarking on a bigger sample. 在原始海报提供的测试数据上,它比Rob的漂亮解决方案慢了一个数量级,但我将对更大的样本做一些基准测试。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.