[英]manipulating a gff file with biopython
I have a GFF file , which is a tab limited 9 column file. 我有一个GFF文件,它是一个制表符受限的9列文件。 My Gff file looks like this : 我的Gff文件如下所示:
chr1 GenBank region 1 2821361 . + 1 ID=CP000253.1
chr1 S-MART utr5 313 516 . + . ID=CP000253.1|+313..516
chr1 GenBank gene 517 1878 . + 1 ID=SAOUHSC_00001
......... and so on. ......... 等等。
Problem Statement : 问题陈述 :
Now , I want to merge the rows which satisfy a condition. 现在,我要合并满足条件的行。 The condition is the 5th column value of ith row should be equal to 4th column of the i+1 row minus 1. 条件是第i行的第5列值应等于i + 1行的第4列减1。
so the final result should be like 所以最终结果应该像
chr1 GenBank region 1 2821361 . + 1 ID=CP000253.1
chr1 predict TU 313 1878 . + 1 ID=SAOUHSC_00001
To do this, I wrote the following program: 为此,我编写了以下程序:
from BCBio import GFF
from Bio.SeqFeature import SeqFeature, FeatureLocation
in_file = "infile.gff"
out_file = "outfile.gff"
limit_info = dict(
gff_type = ['CDS','exon','gene','mRNA','operon','rRNA','tRNA','utr3','utr5'])
new_qualifiers = {"source": "prediction","ID": "CP000253.1"}
new_sub_qualifiers = {"source": "prediction"}
new_top_feature = SeqFeature(FeatureLocation(0, 2821361), type="genomeRegion", strand=1,
qualifiers=new_qualifiers)
i=0
in_handle = open(in_file)
for rec in GFF.parse(in_handle, limit_info=limit_info):
for i in range(10):
if rec.features[i].location.end == rec.features[i+1].location.start :
# print rec.features[i]
new_top_feature.sub_features[i] =
[SeqFeature(FeatureLocation(rec.features[i].location.start ,
rec.features[i+1].location.end ,strand=rec.features[i].strand),
type="Transcription_unit", qualifiers=new_sub_qualifiers)]
in_handle.close()
rec.features = [new_top_feature]
with open(out_file, "w") as out_handle:
GFF.write([rec], out_handle)
I get the following error : 我收到以下错误:
/usr/lib/python2.7/dist-packages/Bio/SeqFeature.py:171: BiopythonDeprecationWarning: Rather using f.sub_features, f.location should be a CompoundFeatureLocation
BiopythonDeprecationWarning)
Traceback (most recent call last):
File "/home/nkumar/workplacekepler/random/src/limit.py", line 26, in <module>
new_top_feature.sub_features[i] = [SeqFeature(FeatureLocation(rec.features[i].location.start , rec.features[i+1].location.end ,strand=rec.features[i].strand), type="Transcription_unit", qualifiers=new_sub_qualifiers)]
IndexError: list assignment index out of range
Even though it is a index out of range error, I am not able to figure out , what is wrong? 即使是索引超出范围的错误,我也无法弄清楚,这是怎么回事?
in_handle = open(in_file)
for rec in GFF.parse(in_handle, limit_info=limit_info):
for i in range(10):
if rec.features[i].location.end == rec.features[i+1].location.start :
print 1
else:
print rec.features[i]
in_handle.close()
This one works perfectly and prints all the features. 这个功能完美,可以打印所有功能。
You defined new_top_feature as: 您将new_top_feature定义为:
type: genomeRegion
location: [0:2821361](+)
qualifiers:
Key: ID, Value: CP000253.1
Key: source, Value: prediction
But it has not subfeatures 但是它没有子功能
>>> print new_top_feature.sub_features
[]
new_top_feature.sub_features
is thus an empty list. 因此, new_top_feature.sub_features
是一个空列表。 You can not assign to an empty list directly: 您不能直接分配给空列表:
>>> a = []
>>> a[0] = 3
Traceback (most recent call last):
File "<input>", line 1, in <module>
IndexError: list assignment index out of range
And this is what you are doing in 这就是你在做什么
new_top_feature.sub_features[i] = .....
To add data to this list you shoud use append
instead than indexing. 要将数据添加到此列表,您应该使用append
而不是索引。 In case you need to fill the list at given positions in aleatory order, you could create a list of the adequate size filled with zeros and then assign the values to the positions as they come. 如果您需要按照临时顺序在给定位置填写列表,则可以创建一个适当大小的列表,并用零填充,然后将值分配给它们。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.