简体   繁体   English

在 python 中使用后缀树

[英]Working with suffix trees in python

I'm relatively new to python and am starting to work with suffix trees.我对 python 比较陌生,并且开始使用后缀树。 I can build them, but I'm running into a memory issue when the string gets large.我可以构建它们,但是当字符串变大时我遇到了 memory 问题。 I know that they can be used to work with DNA strings of size 4^10 or 4^12, but whenever I try to implement a method, I end up with a memory issue.我知道它们可用于处理大小为 4^10 或 4^12 的 DNA 字符串,但每当我尝试实施一种方法时,我都会遇到 memory 问题。

Here is my code for generating the string and the suffix tree.这是我生成字符串和后缀树的代码。

import random

def get_string(length):
    string=""
    for i in range(length):
        string += random.choice("ATGC")
    return string

word=get_string(4**4)+"$"

def suffixtree(string):
    for i in xrange(len(string)):
        if tree.has_key(string[i]):
            tree[string[i]].append([string[i+1:]][0])
        else:
            tree[string[i]]=[string[i+1:]]
    return tree

tree={}
suffixtree(word)

When I get up to around 4**8, I run into severe memory problems.当我达到 4**8 左右时,我遇到了严重的 memory 问题。 I'm rather new to this so I'm sure I'm missing something with storing these things.我对此很陌生,所以我确定我在存储这些东西时遗漏了一些东西。 Any advice would be greatly appreciated.任何建议将不胜感激。

As a note: I want to do string searching to look for matching strings in a very large string.请注意:我想进行字符串搜索以在非常大的字符串中查找匹配的字符串。 The search string match size is 16. So, this would look for a string of size 16 within a large string, and then move onto the next string and perform another search.搜索字符串匹配大小为 16。因此,这将在大字符串中查找大小为 16 的字符串,然后移动到下一个字符串并执行另一次搜索。 Since I'll be doing a very large number of searches, a suffix tree was suggested.由于我将进行大量搜索,因此建议使用后缀树。

Many thanks非常感谢

This doesn't look like a tree to me.这在我看来不像一棵树。 It looks like you are generating all possible suffixes, and storing them in a hashtable.看起来您正在生成所有可能的后缀,并将它们存储在哈希表中。

You will likely get much smaller memory performance if you use an actual tree.如果您使用实际的树,您可能会获得更小的 memory 性能。 I suggest using a library implementation.我建议使用库实现。

As others have said already, the data structure you are building is not a suffix tree.正如其他人已经说过的,您正在构建的数据结构不是后缀树。 However, the memory issues stem largely from the fact that your data structure involves a lot of explicit string copies.但是, memory 问题主要源于您的数据结构涉及大量显式字符串副本这一事实。 A call like this像这样的电话

string[i+1:]

creates an actual (deep) copy of the substring starting at i+1 .i+1开始创建 substring 的实际(深)副本。

If you are still interested in constructing your original data structure (whatever its use may be), a good solution is to use buffers instead of string copies.如果您仍然对构建原始数据结构感兴趣(无论它的用途是什么),一个好的解决方案是使用缓冲区而不是字符串副本。 Your algorithm would then look like this:您的算法将如下所示:

def suffixtree(string):
    N = len(string)
    for i in xrange(N):
        if tree.has_key(string[i]):
            tree[string[i]].append(buffer(string,i+1,N))
        else:
            tree[string[i]]=[buffer(string,i+1,N)]
    return tree

I tried this embedded in the rest of your code, and confirmed that it requires significantly less then 1 GB of main memory even at a total length of 8^11 characters.我尝试将其嵌入到您的代码的 rest 中,并确认即使总长度为 8^11 个字符,它也需要明显少于 1 GB 的主 memory。

Note that this will likely be relevant even if you switch to an actual suffix tree.请注意,即使您切换到实际的后缀树,这也可能是相关的。 A correct suffix tree implementation will not store copies (not even buffers) in the tree edges;正确的后缀树实现不会在树的边缘存储副本(甚至缓冲区); however, during tree construction you might need a lot of temporary copies of the strings.但是,在构建树期间,您可能需要大量字符串的临时副本。 Using the buffer type for these is a very good idea to avoid putting a heavy burden on the garbage collector for all the unnecessary explicit string copies.为这些使用buffer类型是一个很好的主意,可以避免为所有不必要的显式字符串副本给垃圾收集器带来沉重负担。

If your memory problems lie in creating the suffix tree, are you sure you need one?如果您的 memory 问题在于创建后缀树,您确定需要一个吗? You could find all matches in a single string like this:您可以像这样在一个字符串中找到所有匹配项:

word=get_string(4**12)+"$"

def matcher(word, match_string):
    positions = [-1]
    while 1:
        positions.append(word.find(match_string, positions[-1] + 1))
        if positions[-1] == -1:
            return positions[1:-1]

print matcher(word,'AAAAAAAAAAAA')
[13331731, 13331732, 13331733]
print matcher('AACTATAAATTTACCA','AT')
[4, 8]

My machine is pretty old, and this took 30 secs to run, with 4^12 string.我的机器很旧,运行 4^12 字符串需要 30 秒。 I used a 12 digit target so there would be some matches.我使用了一个 12 位数字的目标,所以会有一些匹配。 Also this solution will find overlapping results - should there be any.此解决方案还将找到重叠的结果——如果有的话。

Here is a suffix tree module you could try, like this:是您可以尝试的后缀树模块,如下所示:

import suffixtree
stree = suffixtree.SuffixTree(word)
print stree.find_substring("AAAAAAAAAAAA")

Unfort.netly, my machine is too slow to test this out properly with long strings.不幸的是,我的机器太慢了,无法用长字符串正确地测试它。 But presumably once the suffixtree is built the searches will be very fast, so for large amounts of searches it should be a good call.但是大概一旦构建了后缀树,搜索就会非常快,所以对于大量搜索来说,这应该是一个很好的选择。 Further find_substring only returns the first match (don't know if this is an issue, I'm sure you could adapt it easily).此外find_substring只返回第一个匹配项(不知道这是否是一个问题,我相信你可以轻松地适应它)。

Update: Split the string into smaller suffix trees, thus avoiding memory problems更新:将字符串拆分成更小的后缀树,从而避免 memory 问题

So if you need to do 10 million searches on 4^12 length string, we clearly do not want to wait for 9.5 years (standard simple search, I first suggested, on my slow machine...).因此,如果您需要对 4^12 长度的字符串进行 1000 万次搜索,我们显然不想等待 9.5 年(标准简单搜索,我首先建议,在我的慢速机器上......)。 However, we can still use suffix trees (thus being a lot quicker), AND avoid the memory issues.然而,我们仍然可以使用后缀树(因此更快),并避免 memory 问题。 Split the large string into manageable chunks (which we know the machines memory can cope with) and turn a chunk into a suffix tree, search it 10 million times, then discard that chunk and move onto the next one.将大字符串拆分为可管理的块(我们知道机器 memory 可以处理)并将一个块变成后缀树,搜索 1000 万次,然后丢弃该块并移动到下一个。 We also need to remember to search the overlap between each chunk.我们还需要记住搜索每个块之间的重叠。 I wrote some code to do this (It assumes the large string to be searched, word is a multiple of our maximum manageable string length, max_length , you'll have to adjust the code to also check the remainder at the end, if this is not the case):我写了一些代码来做到这一点(它假设要搜索的大字符串, word是我们最大可管理字符串长度max_length的倍数,你必须调整代码以在最后检查余数,如果这是不是这样):

def split_find(word,search_words,max_length):
    number_sub_trees = len(word)/max_length
    matches = {}
    for i in xrange(0,number_sub_trees):
        stree = suffixtree.SuffixTree(word[max_length*i:max_length*(i+1)])
        for search in search_words:
            if search not in matches:
                match = stree.find_substring(search)
                if match > -1:
                    matches[search] = match + max_length*i,i
            if i < number_sub_trees:
                match = word[max_length*(i+1) - len(search):max_length*(i+1) + len(search)].find(search)
                if match > -1:
                    matches[search] = match + max_length*i,i
    return matches

word=get_string(4**12)
search_words = ['AAAAAAAAAAAAAAAA'] #list of all words to find matches for
max_length = 4**10 #as large as your machine can cope with (multiple of word)
print split_find(word,search_words,max_length)

In this example I limit the max suffix tree length to length 4^10, which needs about 700MB.在此示例中,我将最大后缀树长度限制为 4^10,这大约需要 700MB。 Using this code, for one 4^12 length string, 10 million searches should take around 13 hours (full searches, with zero matches, so if there are matches it will be quicker).使用此代码,对于一个 4^12 长度的字符串,1000 万次搜索大约需要 13 个小时(完整搜索,零匹配,因此如果有匹配,速度会更快)。 However, as part of this we need to build 100 suffix trees, which will take around..100*41sec= 1 hour.但是,作为其中的一部分,我们需要构建 100 棵后缀树,这大约需要 ..100*41sec= 1 小时。

So the total time to run is around 14 hours, without memory issues... Big improvement on 9.5 years.所以总运行时间约为 14 小时,没有 memory 问题......比 9.5 年有了很大改进。 Note that I am running this on a 1.6GHz CPU with 1GB RAM, so you ought to be able to do way better than this!请注意,我在具有 1GB RAM 的 1.6GHz CPU 上运行此程序,因此您应该能够做得比这更好!

The reason you get memory problems is that for input 'banana' you are generating {'b': ['anana$'], 'a': ['nana$', 'na$', '$'], 'n': ['ana$', 'a$']} .你得到 memory 问题的原因是对于输入'banana'你正在生成{'b': ['anana$'], 'a': ['nana$', 'na$', '$'], 'n': ['ana$', 'a$']} That isn't a tree structure.那不是树结构。 You have every possible suffix of the input created and stored in one of the lists.您已创建并存储在其中一个列表中的所有可能的输入后缀。 That takes O(n^2) storage space.这需要 O(n^2) 存储空间。 Also, for a suffix tree to work properly, you want the leaf nodes to give you index positions.此外,为了使后缀树正常工作,您希望叶节点为您提供索引位置。

The result you want to get is {'banana$': 0, 'a': {'$': 5, 'na': {'$': 3, 'na$': 1}}, 'na': {'$': 4, 'na$': 2}} .你想要得到的结果{'banana$': 0, 'a': {'$': 5, 'na': {'$': 3, 'na$': 1}}, 'na': {'$': 4, 'na$': 2}} (This is an optimized representation; a simpler approach limits us to single-character labels.) (这是一种优化表示;一种更简单的方法将我们限制在单字符标签上。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM