简体   繁体   English

在干草堆中搜索几个等长的针(Python)

[英]Search haystack for several equal length needles (Python)

I am looking for a way to search a large string for a large number of equal length substrings.我正在寻找一种在大字符串中搜索大量等长子字符串的方法。

My current method is basically this:我目前的方法基本上是这样的:

offset = 0
found = []

while offset < len(haystack):
  current_chunk = haystack[offset*8:offset*8+8]
  if current_chunk in needles:
     found.append(current_chunk)
  offset += 1

This is painfully slow.这是痛苦的缓慢。 Is there a better python way of doing this?有没有更好的python方法来做到这一点?

More Pythonic, much faster:更多 Pythonic,更快:

for needle in needles:
    if needle in haystack:
        found.append(needle)

Edit: With some limited testing here are test results编辑:这里有一些有限的测试是测试结果

This algorithm: 0.000135183334351这个算法: 0.000135183334351

Your algorithm: 0.984048128128你的算法: 0.984048128128

Much faster.快多了。

I think that you can break it up on a multicore and parallelize your search.我认为您可以在多核上将其分解并并行化您的搜索。 Something along the lines of:类似的东西:

from multiprocessing import Pool

text = "Your very long string"

"""
A generator function for chopping up a given list into chunks of
length n.
"""
def chunks(l, n):
  for i in xrange(0, len(l), n):
    yield l[i:i+n]

def searchHaystack(haystack, needles):
    offset = 0
    found = []

    while offset < len(haystack):
      current_chunk = haystack[offset*8:offset*8+8]
      if current_chunk in needles:
      found.append(current_chunk)
      offset += 1
    return(needles)

# Build a pool of 8 processes
pool = Pool(processes=8,)

# Fragment the string data into 8 chunks
partitioned_text = list(chunks(text, len(text) / 8))

# Generate all the needles found
all_the_needles = pool.map(searchHaystack, partitioned_text, needles)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM