繁体   English   中英

在 python 中将 k-mers 加载到 dict 中的最有效方法是什么?

[英]What's the most efficient way to load k-mers into dict in python?

目前,我有数百个细菌基因组(fasta 文件),我想解析这些 fasta 文件,然后将这些文件中的 k-mers 加载到一个字典中。 例如,

法斯特

>1 
ATAATA 
>2 
TTTAAA
.....

B.fasta

>1 
ATAAGA 
>2 
TTTAGA 
......

然后,我想得到的 dict 看起来像(假设 k=4 ):

d={'ATAA':{'A':'','B':'',...},'TAAA':{'A':'',...},...} # A refers to "A.fasta", B refers to"B.fasta"

但是,我发现它对我自己的代码效率不够(见下文)......有没有更有效的方法来实现这个目标?

    import re 
    import os 
    from Bio import SeqIO 
    from collections import defaultdict 
    import sys
    sys.path.append('..')
    from library import seqpy 
    def build_kmer_dict(idir,k): 
        print('Load k-mer to dict...') 
        dlabel=defaultdict(lambda:{}) 
        c=1  
        label_match={} 
        for filename in os.listdir(idir): 
            ff=idir+'/'+filename 
            seq_dict = {rec.id : rec.seq for rec in SeqIO.parse(ff, "fasta")} 
            for cl in seq_dict: 
                seq=str(seq_dict[cl]) 
                for i in range(len(seq)-k+1): 
                    kmer=seq[i:i+k] 
                    rev_kmer=seqpy.revcomp(seq[i:i+k]) 
                    dlabel[kmer][c]='' 
                    dlabel[rev_kmer][c]='' 
            label_match[c]=filename
            c+=1 
        return dlabel,label_match 
    # All genomes fasta files are in the folder "../Fasta_File_Dir"
    d, lm=build_kmer_dict('../Fasta_File_Dir',31)

基于原始帖子评论的演示代码。 一次处理整个反向补码,而不是迭代地重叠片段。

import re 
import os 
from Bio import SeqIO 
from collections import defaultdict 
import sys
sys.path.append('..')
from library import seqpy 
def build_kmer_dict(idir,k): 
    print('Load k-mer to dict...') 
    dlabel=defaultdict(lambda:{}) 
    c=1  
    label_match={} 
    for filename in os.listdir(idir): 
        ff=idir+'/'+filename 
        seq_dict = {rec.id : rec.seq for rec in SeqIO.parse(ff, "fasta")}
        for cl in seq_dict: 
            seq=str(seq_dict[cl])
            rev_seq =seqpy.revcomp(seq) # calculate reverse sequence once
            for i in range(len(seq)-k+1): 
                kmer=seq[i:i+k]
                #rev_kmer=seqpy.revcomp(seq[i:i+k]) 
                rev_kmer=rev_seq[i:i+k] # assumes that order doesn't matter
                dlabel[kmer][c]='' 
                dlabel[rev_kmer][c]='' 
        label_match[c]=filename
        c+=1 
    return dlabel,label_match 
# All genomes fasta files are in the folder "../Fasta_File_Dir"
d, lm=build_kmer_dict('../Fasta_File_Dir',31)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM