[英]Extracting specific data from a file and writing it to another file
I tagged python and perl in this only because that's what I've used thus far. 我之所以在其中标记了python和perl,是因为到目前为止,我一直在使用它。 If anyone knows a better way to go about this I'd certainly be willing to try it out. 如果有人知道更好的解决方法,我当然愿意尝试一下。 Anyway, my problem: 无论如何,我的问题是:
I need to create an input file for a gene prediction program that follows the following format: 我需要为遵循以下格式的基因预测程序创建一个输入文件:
seq1 5 15
seq1 20 34
seq2 50 48
seq2 45 36
seq3 17 20
Where seq# is the geneID and the numbers to the right are the positions of exons within an open reading frame. 其中seq#是geneID,右边的数字是开放阅读框中外显子的位置。 Now I have this information, in a .gff3 file that has a lot of other information. 现在,我在具有许多其他信息的.gff3文件中有了此信息。 I can open this with excel and easily delete the columns with non-relevant data. 我可以使用excel打开此文件,并轻松删除不相关数据的列。 Here's how it's arranged now: 现在是这样安排的:
PITG_00002 . gene 2 397 . + . ID=g.1;Name=ORF%
PITG_00002 . mRNA 2 397 . + . ID=m.1;
**PITG_00002** . exon **2 397** . + . ID=m.1.exon1;
PITG_00002 . CDS 2 397 . + . ID=cds.m.1;
PITG_00004 . gene 1 1275 . + . ID=g.3;Name=ORF%20g
PITG_00004 . mRNA 1 1275 . + . ID=m.3;
**PITG_00004** . exon **1 1275** . + . ID=m.3.exon1;P
PITG_00004 . CDS 1 1275 . + . ID=cds.m.3;P
PITG_00004 . gene 1397 1969 . + . ID=g.4;Name=
PITG_00004 . mRNA 1397 1969 . + . ID=m.4;
**PITG_00004** . exon **1397 1969** . + . ID=m.4.exon1;
PITG_00004 . CDS 1397 1969 . + . ID=cds.m.4;
So I need only the data that is in bold. 因此,我只需要粗体的数据。 For example, 例如,
PITG_0002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
Any help you could give would be greatly appreciated, thanks! 谢谢您能提供的任何帮助!
Edit: Well I messed up the formatting. 编辑:好吧,我搞砸了格式。 Anything that is between the **'s is what I need lol. 我需要的是介于**之间的所有内容。
In Unix: 在Unix中:
grep <file.gff3 " exon " |
sed "s/^\([^ ]+\) +[.] +exon +\([0-9]+\) \([0-9]+\).*$/\1 \2 \3/"
For pedestrians: 对于行人:
(this is Python) (这是Python)
with open(data_file) as f:
for line in f:
tokens = line.split()
if len(tokens) > 3 and tokens[2] == 'exon':
print tokens[0], tokens[3], tokens[4]
which prints 哪个打印
PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
It looks like your data is tab-separated. 看来您的数据是制表符分隔的。
This Perl program will print columns 1, 4 and 5 from all records that have exon
in the third column. 该Perl程序将从第三列中具有exon
所有记录中打印第1、4和5列。 You need to change the file name in the open
statement to your actual file name. 您需要将open
语句中的文件名更改为您的实际文件名。
use strict;
use warnings;
open my $fh, '<', 'genes.gff3' or die $!;
while (<$fh>) {
chomp;
my @fields = split /\t/;
next unless @fields >= 5 and $fields[2] eq 'exon';
print join("\t", @fields[0,3,4]), "\n";
}
output 产量
PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
Here's a Perl script option perl scriptName.pl file.gff3
: 这是一个Perl脚本选项perl scriptName.pl file.gff3
:
use strict;
use warnings;
while (<>) {
print "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/;
}
Output: 输出:
PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
Or you could just do the following: 或者,您可以执行以下操作:
perl -n -e 'print "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/' file.gff3
To save the data to a file: 要将数据保存到文件:
use strict;
use warnings;
open my $inFH, '<', 'file.gff3' or die $!;
open my $outFH, '>>', 'data.txt' or die $!;
while (<$inFH>) {
print $outFH "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.