[英]Insert a line after pattern match
I have a file as follows: 我有一个文件,如下所示:
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 3480
Scaffold2 GeneWise CDS 4175 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 3302
Scaffold3 GeneWise CDS 4143 4344
I want to have this output: 我想要这个输出:
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 **3480**
Scaffold2 GeneWise 1st_intron **3480 4175**
Scaffold2 GeneWise CDS **4175** 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 **3302**
Scaffold3 GeneWise 1st_intron **3302 4143**
Scaffold3 GeneWise CDS **4143** 4344
It should go as follows: If column 3 is 'mRNA', take the 5th column of the next line and the 4th column of the line after and insert a new line between the two that contains the 4th and 5th columns (as bold numbers indicate) with the third column called '1st_intron'. 它应如下所示:如果第3列是'mRNA',则取下一行的第5列和其后的第4列,然后在包含第4列和第5列的两者之间插入新行(如粗体数字所示) ),第三列称为“ 1st_intron”。
I have never dealt with such a problem, if you could give me some hint, that would be great. 我从来没有处理过这样的问题,如果您能给我一些提示,那就太好了。
You can use this simple awk: 您可以使用以下简单的awk:
awk '$3=="mRNA"{p=1; print; next}
p{s=$1 FS $2 FS "1st_intron" FS $5; print; p=0; next}
s{print s, $4; s=""} 1' file | column -t
Output: 输出:
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 3480
Scaffold2 GeneWise 1st_intron 3480 4175
Scaffold2 GeneWise CDS 4175 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 3302
Scaffold3 GeneWise 1st_intron 3302 4143
Scaffold3 GeneWise CDS 4143 4344
column -t
is only used to format the output. column -t
仅用于格式化输出。
$ cat tst.awk
p1 == "mRNA" { x=$5 }
p2 == "mRNA" { print $1, $2, "1st_intron", x, $4 }
{ print; p2=p1; p1=$3 }
$ awk -f tst.awk file | column -t
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 3480
Scaffold2 GeneWise 1st_intron 3480 4175
Scaffold2 GeneWise CDS 4175 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 3302
Scaffold3 GeneWise 1st_intron 3302 4143
Scaffold3 GeneWise CDS 4143 4344
Perl solution. Perl解决方案。
$intron
is 0 if you don't want to do anything. 如果您不想执行任何操作,则$intron
为0。 It's set to 1 when you process an mRNA line, so $left
can remember the first number on the next line and set $intron
to 2, which prints intron line and resets $intron
. 处理mRNA行时将其设置为1,因此$left
可以记住下一行的第一个数字,并将$intron
设置$intron
2,这将打印内含子行并重置$intron
。
#!/usr/bin/perl
use warnings;
use strict;
my $intron = 0;
my ($left, $right);
while (<>) {
my @items = split;
if (1 == $intron) {
$left = $items[4];
$intron = 2;
} elsif (2 == $intron) {
print join "\t", @items[0, 1], '1st_intron', $left, $items[3];
print "\n";
$intron = 0;
}
$intron = 1 if 'mRNA' eq $items[2];
print;
}
awk has a nice look-ahead function "getline": awk有一个很好的预读功能“ getline”:
awk '$3=="mRNA"{print;getline;c5=$5;print;getline;print $1," ",$2," 1st_intron",c5,$4;print}'
Tested: 经过测试:
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 3480
Scaffold2 GeneWise 1st_intron 3480 4175
Scaffold2 GeneWise CDS 4175 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 3302
Scaffold3 GeneWise 1st_intron 3302 4143
Scaffold3 GeneWise CDS 4143 4344
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.