[英]Search haystack for several equal length needles (Python)
I am looking for a way to search a large string for a large number of equal length substrings.我正在寻找一种在大字符串中搜索大量等长子字符串的方法。
My current method is basically this:我目前的方法基本上是这样的:
offset = 0
found = []
while offset < len(haystack):
current_chunk = haystack[offset*8:offset*8+8]
if current_chunk in needles:
found.append(current_chunk)
offset += 1
This is painfully slow.这是痛苦的缓慢。 Is there a better python way of doing this?有没有更好的python方法来做到这一点?
More Pythonic, much faster:更多 Pythonic,更快:
for needle in needles:
if needle in haystack:
found.append(needle)
Edit: With some limited testing here are test results编辑:这里有一些有限的测试是测试结果
This algorithm: 0.000135183334351这个算法: 0.000135183334351
Your algorithm: 0.984048128128你的算法: 0.984048128128
Much faster.快多了。
I think that you can break it up on a multicore and parallelize your search.我认为您可以在多核上将其分解并并行化您的搜索。 Something along the lines of:类似的东西:
from multiprocessing import Pool
text = "Your very long string"
"""
A generator function for chopping up a given list into chunks of
length n.
"""
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
def searchHaystack(haystack, needles):
offset = 0
found = []
while offset < len(haystack):
current_chunk = haystack[offset*8:offset*8+8]
if current_chunk in needles:
found.append(current_chunk)
offset += 1
return(needles)
# Build a pool of 8 processes
pool = Pool(processes=8,)
# Fragment the string data into 8 chunks
partitioned_text = list(chunks(text, len(text) / 8))
# Generate all the needles found
all_the_needles = pool.map(searchHaystack, partitioned_text, needles)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.