简体   繁体   English


[英]Optimizing string search in python

I have to write a python program that given a large 50 MB DNA sequence and a smaller one, of around 15 characters, returns a list of all sequences of 15 characters ordered by how close they are to the one given as well as where they are in the larger one. 我必须编写一个python程序,该程序给出了一个大的50 MB DNA序列和一个较小的序列(约15个字符),返回了所有15个字符的序列的列表,这些序列按它们与给定序列的接近程度以及在何处排序在更大的一个。

My current approach is to first get all the subsequences: 我当前的方法是首先获取所有子序列:

def get_subsequences_of_size(size, data):
    sequences = {}
    i = 0
    while(i+size <= len(data)):
        sequence = data[i:i+size]
        if sequence not in sequences:
            sequences[sequence] = data.count(sequence)
        i += 1
    return sequences

and then pack them in a list of dictionaries according to what the problem asked (I forgot to get the position): 然后根据问题的要求将它们打包在词典列表中(我忘了获得职位):

def find_similar_sequences(seq, data):
    similar_sequences = {}
    sequences = get_subsequences_of_size(len(seq), data)
    for sequence in sequences.keys():
        diffs, muts = calculate_similarity(seq,sequence)
        if diffs not in similar_sequences:
            similar_sequences[diffs] = [{"Sequence": sequence, "Mutations": muts}]
            similar_sequences[diffs].append({"Sequence": sequence, "Mutations": muts})
        #similar_sequences[sequence] = {"Similarity": (len(sequence)-diffs), "Differences": diffs, "Mutatations": muts}
    return similar_sequences

My problem is that this running way too slow. 我的问题是这种运行方式太慢。 With the 50MB input, it takes over 30 minutes to finish processing. 使用50MB的输入,需要30分钟以上才能完成处理。

What about the following approach: 那么以下方法呢?

Go with a sliding window of length 15 over your long sequence and for every subsequence: 在长序列和每个子序列上使用长度为15的滑动窗口:

  • store the start location on the long sequence 将开始位置存储在长序列上
  • calculate and store the similarity 计算并存储相似度
import re
from itertools import islice
from collections import defaultdict


def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    # from https://docs.python.org/release/2.3.5/lib/itertools-example.html
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield ''.join(result)
    for elem in it:
        result = result[1:] + (elem,)
        yield ''.join(result)

def hamming_distance(s1, s2):
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

k = len(short_seq)
locations = defaultdict(list)
similarities = defaultdict(set)

for start, subseq in enumerate(window(long_seq, k)):
    similarity = hamming_distance(subseq, short_seq) # substitute with your own similarity function

with open(r'stack46268997.txt', 'w') as f:
    for similarity in sorted(similarities.keys()):
        f.write("Sequence(s) which differ in {} base(s) from the short sequence:\n".format(similarity))
        for subseq in similarities[similarity]:
            f.write("{} at location(s) {}\n".format(subseq, ', '.join(map(str, locations[subseq]))))

This outputs the list of subsequences ordered by how close they are to the given sequence. 这将输出子序列列表,这些子序列按它们与给定序列的接近程度排序。

Sequence(s) which differ in 0 base(s) from the short sequence:
TGGCGACGGACTTCA at location(s) 300, 500

Sequence(s) which differ in 5 base(s) from the short sequence:
TGGCGATCGCCGTCG at location(s) 362

Sequence(s) which differ in 6 base(s) from the short sequence:
TGGCAACTACCTGAA at location(s) 86
TGGTGAGTATTTTCA at location(s) 401
TGGCGAGGGGGATGC at location(s) 191

Sequence(s) which differ in 7 base(s) from the short sequence:
ATGTGAAGGATGTGA at location(s) 283
AGGGGGATGCCTTCT at location(s) 196
TGACAACAACGTTTA at location(s) 53
CGCTGACGGATTATG at location(s) 154
TTATGACCGTTTTCC at location(s) 164
TGGTTGCTGGTTTCC at location(s) 430
TCGCGTCAGCCCGGA at location(s) 8
AGTCGCCTGAGTCCG at location(s) 30, 536
CGGCGATGTGGTTGC at location(s) 422

[... and so on...]

I also ran the script on a 50 MB FASTA file. 我还在50 MB FASTA文件上运行了该脚本。 On my machine, this took 42 seconds to compute the results and another 30 seconds to write out the results to a file (printing them out would have taken much longer!) 在我的机器上,这需要42秒钟来计算结果,而又需要30秒钟才能将结果写到文件中(打印出来将花费更长的时间!)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM