简体   繁体   English

Python,巨大的迭代性能问题

[英]Python, Huge Iteration Performance Problem

I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. 我正在做3个单词的迭代,每个单词大约有500万个字符,我想找到20个字符的序列来识别每个单词。 That is, I want to find all sequences of length 20 in one word that is unique for that word. 也就是说,我想在一个单词中找到长度为20的所有序列,这个序列对于该单词是唯一的。 My problem is that the code I've written takes an extremely long time to run. 我的问题是我写的代码需要很长时间才能运行。 I've never even completed one word running my program over night. 我甚至没有完成一个单词来运行我的程序过夜。

The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words. 下面的函数采用包含字典的列表,其中每个字典包含20个可能的单词,以及500万个单词之一的位置。

If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue... 如果有人知道如何优化这个我会非常感激,我不知道如何继续......

here's a sample of my code: 这是我的代码示例:

def findUnique(list):
    # Takes a list with dictionaries and compairs each element in the dictionaries
    # with the others and puts all unique element in new dictionaries and finally
    # puts the new dictionaries in a list.
    # The result is a list with (in this case) 3 dictionaries containing all unique
    # sequences and their locations from each string.
    dicList=[]
    listlength=len(list)
    s=0
    valuelist=[]
    for i in list:
        j=i.values()
        valuelist.append(j)
    while s<listlength:
        currdic=list[s]
        dic={}
        for key in currdic:
            currval=currdic[key]
            test=True
            n=0
            while n<listlength:
                if n!=s:
                    if currval in valuelist[n]: #this is where it takes to much time
                        n=listlength
                        test=False
                    else:
                        n+=1
                else:
                    n+=1
            if test:
                dic[key]=currval
        dicList.append(dic)
        s+=1
    return dicList
def slices(seq, length, prefer_last=False):
  unique = {}
  if prefer_last: # this doesn't have to be a parameter, just choose one
    for start in xrange(len(seq) - length + 1):
      unique[seq[start:start+length]] = start
  else: # prefer first
    for start in xrange(len(seq) - length, -1, -1):
      unique[seq[start:start+length]] = start
  return unique

# or find all locations for each slice:
import collections
def slices(seq, length):
  unique = collections.defaultdict(list)
  for start in xrange(len(seq) - length + 1):
    unique[seq[start:start+length]].append(start)
  return unique

This function (currently in my iter_util module ) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). 这个函数(当前在我的iter_util模块中 )是O(n)(n是每个单词的长度),你可以使用set(slices(..)) (使用诸如差异之类的set操作)来获得所有单词中唯一的切片(以下示例)。 You could also write the function to return a set, if you don't want to track locations. 如果您不想跟踪位置,也可以编写函数来返回一个集合。 Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length). 内存使用率会很高(尽管仍然是O(n),只是一个很大的因素),可能通过一个特殊的“延迟切片”类 (存储基本序列(字符串)加上)来缓解(尽管长度不超过20)开始和停止(或开始和长度)。

Printing unique slices: 打印独特的切片:

a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}

Including locations: 包括地点:

a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
               # (depending on which slices function used)

In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. 在更接近您条件的测试脚本中,使用随机生成的5m字符和切片长度为20的单词,内存使用率非常高,以至于我的测试脚本很快达到了我的1G主内存限制并开始颠倒虚拟内存。 At that point Python spent very little time on the CPU and I killed it. 那时,Python花了很少的时间在CPU上,我杀了它。 Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. 减少切片长度或字长(因为我使用完全随机的单词减少重复并增加内存使用)以适应主内存并且它运行不到一分钟。 This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important. 这种情况加上原始代码中的O(n ** 2)将永远存在,这也是算法时间和空间复杂性都很重要的原因。

import operator
import random
import string

def slices(seq, length):
  unique = {}
  for start in xrange(len(seq) - length, -1, -1):
    unique[seq[start:start+length]] = start
  return unique

def sample_with_repeat(population, length, choice=random.choice):
  return "".join(choice(population) for _ in xrange(length))

word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
                              (x for x in words_slices_sets if x is not n),
                              n)
                       for n in words_slices_sets]
print [len(x) for x in unique_words_slices]

You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense. 你说你有500万个字符的“单词”,但我发现很难相信这是一般意义上的单词。

If you can provide more information about your input data, a specific solution might be available. 如果您可以提供有关输入数据的更多信息,则可能会提供特定的解决方案。

For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. 例如,英语文本(或任何其他书面语言)可能足够重复,以便可以使用特里结构 In the worst case however, it would run out of memory constructing all 256^20 keys. 然而,在最坏的情况下,构建所有256 ^ 20密钥的内存将耗尽。 Knowing your inputs makes all the difference. 了解您的输入会产生重大影响。


edit 编辑

I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node. 我看了一些基因组数据,看看这个想法如何叠加,使用硬编码[acgt] - >映射和每个节点4个孩子。

  1. adenovirus 2 : 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes 腺病毒2 :35,937bp-> 35,899个不同的20碱基序列,使用469,339个trie节点

  2. enterobacteria phage lambda : 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes. 肠杆菌噬菌体λ :48,502bp - > 40,921个不同的20碱基序列,使用529,384个trie节点。

I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. 虽然可能在数据中存在更多冗余和/或重叠,但我没有在两个数据集之内或之间发生任何冲突。 You'd have to try it to see. 你必须尝试看看。

If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go. 如果你确实获得了有用数量的碰撞,你可以尝试将三个输入组合在一起,构建一个trie,记录每个叶子的原点并随时修剪trie中的碰撞。

If you can't find some way to prune the keys, you could try using a more compact representation. 如果你找不到某种方法来修剪键,你可以尝试使用更紧凑的表示法。 For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code. 例如,您只需要 2位来存储[acgt] / [0123],这可能会以稍微复杂的代码为代价来节省空间。

I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge. 我不认为你可以蛮力这个 - 你需要找到一些方法来减少问题的规模,这取决于你的领域知识。

Let me build off Roger Pate's answer . 让我建立罗杰佩特的答案 If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. 如果内存是个问题,我建议您不要使用字符串作为字典的键,而是可以使用字符串的散列值。 This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word"). 这将节省存储字符串的额外副本作为密钥的成本(在最坏的情况下,是单个“单词”的存储的20倍)。

import collections
def hashed_slices(seq, length, hasher=None):
  unique = collections.defaultdict(list)
  for start in xrange(len(seq) - length + 1):
    unique[hasher(seq[start:start+length])].append(start)
  return unique

(If you really want to get fancy, you can use a rolling hash , though you'll need to change the function.) (如果你真的想要花哨,你可以使用滚动哈希 ,但你需要更改功能。)

Now, we can combine all the hashes : 现在,我们可以结合所有哈希:

unique = []  # Unique words in first string

# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
   for h, starts in hashed.iteritems() :
     # We only care about the first word
     if h in hashed_starts[0] :
       all_hashed[h][i]=starts

# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
  if len(starts_by_word) == 1 :
    # if there's only one word for the hash, it's obviously valid
    unique.extend(words[0][i:i+20] for i in starts_by_word.values())
  else :
    # we might have a hash collision
    candidates = {}
    for word_idx, starts in starts_by_word.iteritems() :
      candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
    # Now go that we have the candidate slices, find the unique ones
    valid = candidates[0]
    for word_idx, candidate_set in candidates.iteritems() :
      if word_idx != 0 :
        valid -= candidate_set
    unique.extend(valid)

(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.) (我尝试将其扩展为完成所有三项。这是可能的,但并发症会减损算法。)

Be warned, I haven't tested this. 请注意,我没有测试过这个。 Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. 此外,您可以做很多事情来简化代码,但算法很有意义。 The hard part is choosing the hash. 困难的部分是选择哈希。 Too many collisions and you'll won't gain anything. 太多的碰撞,你将无法获得任何东西。 Too few and you'll hit the memory problems. 太少,你会遇到内存问题。 If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. 如果您只处理DNA基本代码,则可以将20个字符的字符串散列为40位数字,但仍然没有冲突。 So the slices will take up nearly a fourth of the memory. 因此切片将占据内存的近四分之一。 That would save roughly 250 MB of memory in Roger Pate's answer. 这将在Roger Pate的答案中节省大约250 MB的内存。

The code is still O(N^2), but the constant should be much lower. 代码仍然是O(N ^ 2),但常数应该低得多。

Let's attempt to improve on Roger Pate's excellent answer . 让我们尝试改进Roger Pate的优秀答案

Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway. 首先,让我们保留集而不是字典 - 无论如何它们都会管理唯一性。

Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. 其次,由于我们可能会耗尽内存而不是耗尽CPU时间(以及耐心),因此为了提高内存效率,我们可能会牺牲CPU效率。 So perhaps try only the 20s starting with one particular letter. 所以也许只从一个特定的字母开始尝试20年代。 For DNA, this cuts the requirements down by 75%. 对于DNA,这将需求降低了75%。

seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
    for letterid in range(maxlength):
        for wordid,word in words:
            if (letterid < len(word)):
                letter = word[letterid]
                if letter is startletter:
                    seq = word[letterid:letterid+seqlen]
                    if seq in seqtrie and not wordid in seqtrie[seq]:
                        seqtrie[seq].append(wordid)

Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc. 或者,如果那仍然是太多的记忆,我们可以通过每个可能的起始对(16个通过而不是4个用于DNA),或者每3个(64个通过)等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM