简体   繁体   English

使用biopython处理gff文件

[英]manipulating a gff file with biopython

I have a GFF file , which is a tab limited 9 column file. 我有一个GFF文件,它是一个制表符受限的9列文件。 My Gff file looks like this : 我的Gff文件如下所示:

chr1    GenBank region  1   2821361 .   +   1   ID=CP000253.1
chr1    S-MART  utr5    313 516     .   +   .   ID=CP000253.1|+313..516
chr1    GenBank gene    517 1878    .   +   1   ID=SAOUHSC_00001

......... and so on. ......... 等等。

Problem Statement : 问题陈述 :

Now , I want to merge the rows which satisfy a condition. 现在,我要合并满足条件的行。 The condition is the 5th column value of ith row should be equal to 4th column of the i+1 row minus 1. 条件是第i行的第5列值应等于i + 1行的第4列减1。

so the final result should be like 所以最终结果应该像

chr1    GenBank region  1   2821361 .   +   1   ID=CP000253.1
chr1    predict TU      313 1878    .   +   1   ID=SAOUHSC_00001

To do this, I wrote the following program: 为此,我编写了以下程序:

from BCBio import GFF
from Bio.SeqFeature import SeqFeature, FeatureLocation

in_file = "infile.gff"
out_file = "outfile.gff"

limit_info = dict(
        gff_type = ['CDS','exon','gene','mRNA','operon','rRNA','tRNA','utr3','utr5'])
new_qualifiers = {"source": "prediction","ID": "CP000253.1"}
new_sub_qualifiers = {"source": "prediction"}
new_top_feature = SeqFeature(FeatureLocation(0, 2821361), type="genomeRegion", strand=1,
                         qualifiers=new_qualifiers)
i=0

in_handle = open(in_file)
for rec in GFF.parse(in_handle, limit_info=limit_info):
    for i in range(10):
        if rec.features[i].location.end == rec.features[i+1].location.start :
            # print rec.features[i]
            new_top_feature.sub_features[i] =     
[SeqFeature(FeatureLocation(rec.features[i].location.start ,  
rec.features[i+1].location.end ,strand=rec.features[i].strand),  
type="Transcription_unit",  qualifiers=new_sub_qualifiers)]             

in_handle.close()

rec.features = [new_top_feature]

with open(out_file, "w") as out_handle:
    GFF.write([rec], out_handle)

I get the following error : 我收到以下错误:

/usr/lib/python2.7/dist-packages/Bio/SeqFeature.py:171: BiopythonDeprecationWarning: Rather using f.sub_features, f.location should be a CompoundFeatureLocation
  BiopythonDeprecationWarning)
Traceback (most recent call last):
  File "/home/nkumar/workplacekepler/random/src/limit.py", line 26, in <module>
    new_top_feature.sub_features[i] = [SeqFeature(FeatureLocation(rec.features[i].location.start , rec.features[i+1].location.end ,strand=rec.features[i].strand), type="Transcription_unit",  qualifiers=new_sub_qualifiers)]
IndexError: list assignment index out of range

Even though it is a index out of range error, I am not able to figure out , what is wrong? 即使是索引超出范围的错误,我也无法弄清楚,这是怎么回事?

in_handle = open(in_file)
for rec in GFF.parse(in_handle, limit_info=limit_info):
    for i in range(10):        
        if rec.features[i].location.end == rec.features[i+1].location.start :
            print 1          
        else:
            print rec.features[i]            
in_handle.close()

This one works perfectly and prints all the features. 这个功能完美,可以打印所有功能。

You defined new_top_feature as: 您将new_top_feature定义为:

type: genomeRegion
location: [0:2821361](+)
qualifiers: 
    Key: ID, Value: CP000253.1
    Key: source, Value: prediction

But it has not subfeatures 但是它没有子功能

>>> print new_top_feature.sub_features
[]

new_top_feature.sub_features is thus an empty list. 因此, new_top_feature.sub_features是一个空列表。 You can not assign to an empty list directly: 您不能直接分配给空列表:

>>> a = []
>>> a[0] = 3
Traceback (most recent call last):
  File "<input>", line 1, in <module>
IndexError: list assignment index out of range

And this is what you are doing in 这就是你在做什么

new_top_feature.sub_features[i] =  .....

To add data to this list you shoud use append instead than indexing. 要将数据添加到此列表,您应该使用append而不是索引。 In case you need to fill the list at given positions in aleatory order, you could create a list of the adequate size filled with zeros and then assign the values to the positions as they come. 如果您需要按照临时顺序在给定位置填写列表,则可以创建一个适当大小的列表,并用零填充,然后将值分配给它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM