简体   繁体   English

模式匹配后插入一行

[英]Insert a line after pattern match

I have a file as follows: 我有一个文件,如下所示:

Scaffold2   GeneWise        mRNA    3038    6649 
Scaffold2   GeneWise        CDS     3038    3480
Scaffold2   GeneWise        CDS     4175    4291
Scaffold3   GeneWise        mRNA    2824    15173
Scaffold3   GeneWise        CDS     2824    3302
Scaffold3   GeneWise        CDS     4143    4344

I want to have this output: 我想要这个输出:

Scaffold2   GeneWise        mRNA    3038    6649 
Scaffold2   GeneWise        CDS     3038    **3480**
Scaffold2   GeneWise        1st_intron     **3480    4175**
Scaffold2   GeneWise        CDS     **4175**    4291
Scaffold3   GeneWise        mRNA    2824    15173
Scaffold3   GeneWise        CDS     2824    **3302**
Scaffold3   GeneWise        1st_intron     **3302    4143**
Scaffold3   GeneWise        CDS     **4143**    4344

It should go as follows: If column 3 is 'mRNA', take the 5th column of the next line and the 4th column of the line after and insert a new line between the two that contains the 4th and 5th columns (as bold numbers indicate) with the third column called '1st_intron'. 它应如下所示:如果第3列是'mRNA',则取下一行的第5列和其后的第4列,然后在包含第4列和第5列的两者之间插入新行(如粗体数字所示) ),第三列称为“ 1st_intron”。

I have never dealt with such a problem, if you could give me some hint, that would be great. 我从来没有处理过这样的问题,如果您能给我一些提示,那就太好了。

You can use this simple awk: 您可以使用以下简单的awk:

awk '$3=="mRNA"{p=1; print; next}
     p{s=$1 FS $2 FS "1st_intron" FS $5; print; p=0; next}
     s{print s, $4; s=""} 1' file | column -t

Output: 输出:

Scaffold2  GeneWise  mRNA        3038  6649
Scaffold2  GeneWise  CDS         3038  3480
Scaffold2  GeneWise  1st_intron  3480  4175
Scaffold2  GeneWise  CDS         4175  4291
Scaffold3  GeneWise  mRNA        2824  15173
Scaffold3  GeneWise  CDS         2824  3302
Scaffold3  GeneWise  1st_intron  3302  4143
Scaffold3  GeneWise  CDS         4143  4344

column -t is only used to format the output. column -t仅用于格式化输出。

$ cat tst.awk
p1 == "mRNA" { x=$5 }
p2 == "mRNA" { print $1, $2, "1st_intron", x, $4 }
{ print; p2=p1; p1=$3 }

$ awk -f tst.awk file | column -t
Scaffold2  GeneWise  mRNA        3038  6649
Scaffold2  GeneWise  CDS         3038  3480
Scaffold2  GeneWise  1st_intron  3480  4175
Scaffold2  GeneWise  CDS         4175  4291
Scaffold3  GeneWise  mRNA        2824  15173
Scaffold3  GeneWise  CDS         2824  3302
Scaffold3  GeneWise  1st_intron  3302  4143
Scaffold3  GeneWise  CDS         4143  4344

Perl solution. Perl解决方案。

$intron is 0 if you don't want to do anything. 如果您不想执行任何操作,则$intron为0。 It's set to 1 when you process an mRNA line, so $left can remember the first number on the next line and set $intron to 2, which prints intron line and resets $intron . 处理mRNA行时将其设置为1,因此$left可以记住下一行的第一个数字,并将$intron设置$intron 2,这将打印内含子行并重置$intron

#!/usr/bin/perl
use warnings;
use strict;

my $intron = 0;
my ($left, $right);
while (<>) {
    my @items = split;

    if (1 == $intron) {
        $left = $items[4];
        $intron = 2;

    } elsif (2 == $intron) {
        print join "\t", @items[0, 1], '1st_intron', $left, $items[3];
        print "\n";
        $intron = 0;
    }

    $intron = 1 if 'mRNA' eq $items[2];
    print;
}

awk has a nice look-ahead function "getline": awk有一个很好的预读功能“ getline”:

awk '$3=="mRNA"{print;getline;c5=$5;print;getline;print $1," ",$2,"       1st_intron",c5,$4;print}'

Tested: 经过测试:

Scaffold2   GeneWise        mRNA    3038    6649
Scaffold2   GeneWise        CDS     3038    3480
Scaffold2   GeneWise        1st_intron 3480 4175
Scaffold2   GeneWise        CDS     4175    4291
Scaffold3   GeneWise        mRNA    2824    15173
Scaffold3   GeneWise        CDS     2824    3302
Scaffold3   GeneWise        1st_intron 3302 4143
Scaffold3   GeneWise        CDS     4143    4344

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM