简体   繁体   English

从文件中提取特定数据并将其写入另一个文件

[英]Extracting specific data from a file and writing it to another file

I tagged python and perl in this only because that's what I've used thus far. 我之所以在其中标记了python和perl,是因为到目前为止,我一直在使用它。 If anyone knows a better way to go about this I'd certainly be willing to try it out. 如果有人知道更好的解决方法,我当然愿意尝试一下。 Anyway, my problem: 无论如何,我的问题是:

I need to create an input file for a gene prediction program that follows the following format: 我需要为遵循以下格式的基因预测程序创建一个输入文件:

seq1 5 15
seq1 20 34

seq2 50 48
seq2 45 36

seq3 17 20

Where seq# is the geneID and the numbers to the right are the positions of exons within an open reading frame. 其中seq#是geneID,右边的数字是开放阅读框中外显子的位置。 Now I have this information, in a .gff3 file that has a lot of other information. 现在,我在具有许多其他信息的.gff3文件中有了此信息。 I can open this with excel and easily delete the columns with non-relevant data. 我可以使用excel打开此文件,并轻松删除不相关数据的列。 Here's how it's arranged now: 现在是这样安排的:

PITG_00002  .   gene    2   397 .   +   .   ID=g.1;Name=ORF%
PITG_00002  .   mRNA    2   397 .   +   .   ID=m.1;
**PITG_00002**  .   exon    **2 397**   .   +   .   ID=m.1.exon1;
PITG_00002  .   CDS 2   397 .   +   .   ID=cds.m.1;

PITG_00004  .   gene    1   1275    .   +   .   ID=g.3;Name=ORF%20g
PITG_00004  .   mRNA    1   1275    .   +   .   ID=m.3;
**PITG_00004**  .   exon    **1 1275**  .   +   .   ID=m.3.exon1;P
PITG_00004  .   CDS 1   1275    .   +   .   ID=cds.m.3;P

PITG_00004  .   gene    1397    1969    .   +   .   ID=g.4;Name=
PITG_00004  .   mRNA    1397    1969    .   +   .   ID=m.4;
**PITG_00004**  .   exon    **1397  1969**  .   +   .   ID=m.4.exon1;
PITG_00004  .   CDS 1397    1969    .   +   .   ID=cds.m.4;

So I need only the data that is in bold. 因此,我只需要粗体的数据。 For example, 例如,

PITG_0002 2 397

PITG_00004 1 1275
PITG_00004 1397 1969

Any help you could give would be greatly appreciated, thanks! 谢谢您能提供的任何帮助!

Edit: Well I messed up the formatting. 编辑:好吧,我搞砸了格式。 Anything that is between the **'s is what I need lol. 我需要的是介于**之间的所有内容。

In Unix: 在Unix中:

grep <file.gff3 " exon " |
    sed "s/^\([^ ]+\) +[.] +exon +\([0-9]+\) \([0-9]+\).*$/\1 \2 \3/"

For pedestrians: 对于行人:

(this is Python) (这是Python)

with open(data_file) as f:
    for line in f:
        tokens = line.split()
        if len(tokens) > 3 and tokens[2] == 'exon':
            print tokens[0], tokens[3], tokens[4]

which prints 哪个打印

PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969

It looks like your data is tab-separated. 看来您的数据是制表符分隔的。

This Perl program will print columns 1, 4 and 5 from all records that have exon in the third column. 该Perl程序将从第三列中具有exon所有记录中打印第1、4和5列。 You need to change the file name in the open statement to your actual file name. 您需要将open语句中的文件名更改为您的实际文件名。

use strict;
use warnings;

open my $fh, '<', 'genes.gff3' or die $!;

while (<$fh>) {
  chomp;
  my @fields = split /\t/;
  next unless @fields >= 5 and $fields[2] eq 'exon';
  print join("\t", @fields[0,3,4]), "\n";
}

output 产量

PITG_00002  2 397
PITG_00004  1 1275
PITG_00004  1397  1969

Here's a Perl script option perl scriptName.pl file.gff3 : 这是一个Perl脚本选项perl scriptName.pl file.gff3

use strict;
use warnings;

while (<>) {
    print "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/;
}

Output: 输出:

PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969

Or you could just do the following: 或者,您可以执行以下操作:

perl -n -e 'print "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/' file.gff3

To save the data to a file: 要将数据保存到文件:

use strict;
use warnings;

open my $inFH,  '<',  'file.gff3' or die $!;
open my $outFH, '>>', 'data.txt'  or die $!;

while (<$inFH>) {
    print $outFH "@{ [ (split)[ 0, 3, 4 ] ] }\n" if /exon/;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 HTML 文件中提取一些数据并将其写入另一个文件 - Extracting some data from an HTML file and writing it to another file 无法从CSV文件中提取数据,对其进行排序,然后将该排序后的列表写入另一个CSV文件中 - Trouble with extracting data from a CSV file, sorting it and then writing that sorted list into another CSV file 从文本文件中提取数据并将其写入csv或平面文件 - Extracting data from a text file and writing it to csv or flat file 使用自定义分隔符从大型文本文件中提取特定分隔符之间的部分文本,然后使用Python将其写入另一个文件 - Extracting parts of text between specific delimiters from a large text file with custom delimiters and writing it to another file using Python 从xml文件中提取特定数据 - extracting specific data from a xml file 从文本文件中提取特定数据 - Extracting specific data from a text file 从python中的文本文件中提取特定数据 - extracting specific data from a text file in python 如何从 JSON 文件中提取特定数据? - How to extracting specific data from a JSON file? 使用Python从xml文件中提取数据并写入xlsxwriter - Extracting data from xml file using Python and writing to xlsxwriter Python-读取单个文件,从中提取数据,然后将其写入许多文件 - Python - reading a single file, extracting data from it, and then writing it to many files
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM