简体   繁体   English

使用python在制表符分隔的文件中找到一个字母数字条目

[英]find a alpha-numeric entry in tab delimited file with python

I'm trying to perform a GOs annotation using the SIMAP database which is Blast2GO annotated. 我正在尝试使用带有Blast2GO批注的SIMAP数据库执行GO批注。 Everything is fine, but I have problems when I try to find the accession number in the file where entry numbers are associated with their GOs. 一切都很好,但是当我尝试在文件中找到条目号与它们的GO关联的登录号时遇到了问题。 The problem is that the script does not find the number in the input file when really there is. 问题在于,当实际存在时,脚本无法在输入文件中找到编号。 I tried several things without good results (re.match, insert in a list and then extract the element, etc) File where the GOs are associated with entry numbers has this structure (accession number, GO term, blats2go score): 我尝试了几种方法,但均没有良好的结果(重新匹配,插入列表中然后提取元素等),其中GO与条目号相关联的文件具有以下结构(登录号,GO术语,blats2go得分):

1f0ba1d119f52ff28e907d2b5ea450db GO:0007154 79 1f0ba1d119f52ff28e907d2b5ea450db GO:0007154 79

1f0ba1d119f52ff28e907d2b5ea450db GO:0005605 99 1f0ba1d119f52ff28e907d2b5ea450db GO:0005605 99

The python code: python代码:

import re
from Bio.Blast import NCBIXML
from Bio import SeqIO

input_file = open('/home/fpiston/Desktop/test_go/test2.fasta', 'rU')
result_handle = open('/home/fpiston/Desktop/test_go/test2.xml', 'rU')
save_file = open('/home/fpiston/Desktop/test_go/test2.out', 'w')

fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go_fake', 'rU')
q_dict =  SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
blast_records = NCBIXML.parse(result_handle)

hits = []

for blast_record in blast_records:
    if blast_record.alignments:
        list = (blast_record.query).split()
        if re.match('ENA|\w*|\w*', list[0]) != None:
            list2 = list[0].split("|")
            save_file.write('%s\t' % list2[1])
        else:
            save_file.write('%s\t' % list[0])
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                h = alignment.hit_def    
                for l in fh:             
                    ls = l.split()       #at this point all right
                    if h in ls:          #here, 'h' in not found in 'fh'
                        print h
                        print 'ok'
                        save_file.write('%s\t' % ls[1])
                save_file.write('\n')
        hits.append(blast_record.query.split()[0])
misses =set(q_dict.keys()) - set(hits)

for i in misses:
    list = i.split("|")
    if len(list) > 1:
        save_file.write('%s\t' % list[1])
    else:
        save_file.write('%s\t' % list)
    save_file.write('%s\n' % 'no_match')

save_file.close() 

This is the code with the correction of martineau (fh.seek(0)): 这是对martineau (fh.seek(0))进行更正的代码:

#!/usr/bin/env python
import sys
import re
from Bio.Blast import NCBIXML
from Bio import SeqIO

input_file = sys.argv[1] #queries sequences in fasta format
out_blast_file = sys.argv[2] #name of the blast results file
output_file = sys.argv[3] #name of the output file

result_handle = open(out_blast_file, 'rU')
fh = open('/home/fpiston/Desktop/test_go/Os_Bd_Ta_blat2go', 'rU')
q_dict =  SeqIO.to_dict(SeqIO.parse(open(input_file), "fasta"))
blast_records = NCBIXML.parse(result_handle)
save_file = open(output_file, 'w')
hits = []

for blast_record in blast_records:
    if blast_record.alignments:
        list = (blast_record.query).split()
        if re.match('ENA|\w*|\w*', list[0]) != None:
            list2 = list[0].split("|")
            save_file.write('\n%s\t' % list2[1])
        else:
            save_file.write('\n%s\t' % list[0])
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                hit = alignment.hit_def
                save_file.write('%s\t' % hit)
                fh.seek(0)
                for l in fh:
                    ls = l.split()
                    if ls[0] in  hit:
                        save_file.write('%s\t' % ls[1])          
        hits.append(blast_record.query.split()[0])

misses =set(q_dict.keys()) - set(hits)

for i in misses:
    list = i.split("|")
    if len(list) > 1:
        save_file.write('\n%s\t' % list[1])
    else:
        save_file.write('\n%s\t' % list)
    save_file.write('%s' % 'no_match')

save_file.close() 

I really have no idea what you're talking about here, but noticed that within the outer for blast_record in blast_records: and for alignment in blast_record.alignments: loops you have a for l in fh: but never rewind the file with a fh.seek(0) anywhere, which means it only reads the lines in the file the first time it's executed -- which seems illogical. 我真的不知道您在这里说什么,但是注意到for blast_record in blast_records:for blast_record in blast_records: for alignment in blast_record.alignments:循环中,您for l in fh:有一个for l in fh:但是永远不要for l in fh:倒带该文件fh.seek(0)在任何地方,这意味着它仅在第一次执行时读取文件中的行,这似乎是不合逻辑的。

You could fix this by adding the fh.seek(0) just before the inner loop. 您可以通过在内循环之前添加fh.seek(0)来解决此问题。 Although unnecessary the very first time the inner loop executes, it's need all the following times and doing it one extra time won't hurt anything. 尽管内循环第一次执行是不必要的,但接下来的所有时间都是必需的,再执行一次不会对任何事情造成伤害。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM