简体   繁体   English

Python / Biopython。 用蛋白质序列解析文件后,获取匹配单词的序列枚举列表

[英]Python/Biopython. Get enumerated list of sequences matching words after parsing file with protein sequences

In Python/Biopython, I am trying to get an enumerated list of protein sequences that match the string "Human adenovirus". 在Python / Biopython中,我试图获取与字符串“人类腺病毒”匹配的蛋白质序列的枚举列表。 The problem with the code below is that I get the enumeration of the sequences to be parsed, but not of those which pass the if loop filter. 以下代码的问题是我得到了要解析的序列的枚举,但没有通过if循环过滤器的序列的枚举。

EDITED CODE with proper syntax: 具有正确语法的已编辑代码:

from Bio import SeqIO
import sys  
sys.stdout = open("out_file.txt","w")

for index, seq_record in enumerate(SeqIO.parse("in_file.txt", "fasta")):
    if "Human adenovirus" in seq_record.description:

        print "%i]" % index, str(seq_record.description) 
        print str(seq_record.seq) + "\n"

This is a piece of the input file: 这是输入文件的一部分:

>gi|927348286|gb|ALE15299.1| penton [Bottlenose dolphin adenovirus 1]
MQRPQQTPPPPYESVVEPLYVPSRYLAPSEGRNSIRYSQLPPLYD

>gi|15485528|emb|CAC67483.1| penton [Human adenovirus 2]
MQRAAMYEEGPPPSYESVVSAAPVAAALGSPFDAPLDPPFVPPRYLRPTGGRNSIRYSELAPLFDTTRVY
LVDNKSTDVASLNYQNDHSNFLTTVIQNNDY

>gi|1194445857|dbj|BAX56610.1| fiber, partial [Human mastadenovirus C]
FNPVYPYDTETGPPTVPFLTPPFVSPNG

The output file I get looks like this: 我得到的输出文件如下所示:

2] gi|15485528|emb|CAC67483.1| penton [Human adenovirus 2]
MQRAAMYEEGPPPSYESVVSAAPVAAALGSPFDAPLDPPFVPPRYLRPTGGRNSIRYSELAPLFDTTRVY
LVDNKSTDVASLNYQNDHSNFLTTVIQNNDY

I would like the first sequence that pass the filter to get the enumeration starting with 1], not with 2] as it is shown before. 我希望第一个通过过滤器的序列以1]而不是2]开头的枚举,如前所示。 I know I need to somehow add a counter after the if loop, but I have tried many alternatives and I do not get the desired output. 我知道我需要在if循环后添加一个计数器,但是我尝试了许多替代方法,但未获得所需的输出。 This should not be difficult, I know how to do it in Perl but not with Python/Biopython. 这应该不难,我知道如何在Perl中做到这一点,但不支持Python / Biopython。

The issue is that you only want to increment the index if the description contains "Human adenovirus", but you are enumerating everything. 问题是,如果描述中包含“人类腺病毒”,则只想增加索引,但是您正在枚举所有内容。

If we modify your code sample to only increment the index when a match is found, we get this: 如果我们修改您的代码示例以仅在找到匹配项时才增加索引,则可以得到以下信息:

from Bio import SeqIO
index = 0
with open("out_file.txt","w") as f:
    for seq_record in SeqIO.parse("in_file.txt", "fasta"):
        if "Human adenovirus" in seq_record.description:
            index += 1
            print "%i]" % index, str(seq_record.description) 
            print str(seq_record.seq) + "\n"

Btw, why are you opening a file for writing, but never writing to it? 顺便说一句,为什么您要打开一个文件进行写入,却从未写入?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在BioPython中使用Entrez从GenBank检索和解析蛋白质序列 - Retrieving and parsing protein sequences from GenBank using Entrez in BioPython 如何在 Python 中将 DNA 列表序列转换为蛋白质序列 - How to turn DNA list sequences into Protein sequences in Python 如何使用Biopython翻译FASTA文件中的一系列DNA序列并将蛋白质序列提取到一个单独的字段中? - How to use Biopython to translate a series of DNA sequences in a FASTA file and extract the Protein sequences into a separate field? 在python中读取蛋白质序列的文本文件 - Read text file of protein sequences in python 用python / biopython计算DNA序列 - Counting DNA sequences with python/biopython 如何使用 python 从一个大的 fasta 文件中提取蛋白质序列的子集? - How to extract a subset of protein sequences from a big fasta file with python? Ncbi蛋白质数据库,如何从特定生物项目中获取蛋白质序列(python脚本) - Ncbi protein database, how to get protein sequences from a specific bioproject (python script) 如何使用python编程将一组DNA序列转换为蛋白质序列? - How to convert a set of DNA sequences into protein sequences using python programming? 使用Python提取Fasta Moonlight蛋白序列 - Extracting Fasta Moonlight Protein Sequences with Python 通过访问 Uniprot 获取蛋白质序列(使用 Python) - Getting protein sequences by accessing Uniprot (with Python)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM