[英]What's the most efficient way to load k-mers into dict in python?
目前,我有数百个细菌基因组(fasta 文件),我想解析这些 fasta 文件,然后将这些文件中的 k-mers 加载到一个字典中。 例如,
法斯特
>1
ATAATA
>2
TTTAAA
.....
B.fasta
>1
ATAAGA
>2
TTTAGA
......
然后,我想得到的 dict 看起来像(假设 k=4 ):
d={'ATAA':{'A':'','B':'',...},'TAAA':{'A':'',...},...} # A refers to "A.fasta", B refers to"B.fasta"
但是,我发现它对我自己的代码效率不够(见下文)......有没有更有效的方法来实现这个目标?
import re
import os
from Bio import SeqIO
from collections import defaultdict
import sys
sys.path.append('..')
from library import seqpy
def build_kmer_dict(idir,k):
print('Load k-mer to dict...')
dlabel=defaultdict(lambda:{})
c=1
label_match={}
for filename in os.listdir(idir):
ff=idir+'/'+filename
seq_dict = {rec.id : rec.seq for rec in SeqIO.parse(ff, "fasta")}
for cl in seq_dict:
seq=str(seq_dict[cl])
for i in range(len(seq)-k+1):
kmer=seq[i:i+k]
rev_kmer=seqpy.revcomp(seq[i:i+k])
dlabel[kmer][c]=''
dlabel[rev_kmer][c]=''
label_match[c]=filename
c+=1
return dlabel,label_match
# All genomes fasta files are in the folder "../Fasta_File_Dir"
d, lm=build_kmer_dict('../Fasta_File_Dir',31)
基于原始帖子评论的演示代码。 一次处理整个反向补码,而不是迭代地重叠片段。
import re
import os
from Bio import SeqIO
from collections import defaultdict
import sys
sys.path.append('..')
from library import seqpy
def build_kmer_dict(idir,k):
print('Load k-mer to dict...')
dlabel=defaultdict(lambda:{})
c=1
label_match={}
for filename in os.listdir(idir):
ff=idir+'/'+filename
seq_dict = {rec.id : rec.seq for rec in SeqIO.parse(ff, "fasta")}
for cl in seq_dict:
seq=str(seq_dict[cl])
rev_seq =seqpy.revcomp(seq) # calculate reverse sequence once
for i in range(len(seq)-k+1):
kmer=seq[i:i+k]
#rev_kmer=seqpy.revcomp(seq[i:i+k])
rev_kmer=rev_seq[i:i+k] # assumes that order doesn't matter
dlabel[kmer][c]=''
dlabel[rev_kmer][c]=''
label_match[c]=filename
c+=1
return dlabel,label_match
# All genomes fasta files are in the folder "../Fasta_File_Dir"
d, lm=build_kmer_dict('../Fasta_File_Dir',31)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.