如何将带有字符串的AWK用作RS？

Question

I want to use AWK, but I don't seem to get the first record right. 我想使用AWK，但似乎没有正确的第一张唱片。 I hope anyone can help to get it right. 我希望任何人都可以帮助正确解决。

I have this file, every record is 3 lines but sometimes it has 4 lines (so there is a $3 and $4). 我有这个文件，每条记录是3行，但有时它有4行（所以有$ 3和$ 4）。 My goal is to print all three lines of each record, and if there is a forth line I want also to print the first 2 lines with the forth (without the 3rd). 我的目标是打印每条记录的所有三行，如果有第四行，我还要打印前两行和第四行（不打印第三行）。

My strategy is to use a string ("Sequence: ") as a RS, and new line ("\\n") for FS. 我的策略是使用字符串（“ Sequence：”）作为RS，并为FS使用换行（“ \\ n”）。

My file looks like this: 我的文件如下所示：

Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

With the following code I get a messed up first record, because the string is in the beginning of the file as well. 通过以下代码，我得到了一条混乱的第一条记录，因为该字符串也位于文件的开头。

awk '{ RS="Sequence: "; FS="\n" }
{
if ($4 != "" )
    print $1,"\n",$2,"\n",$3,"\n",$1,"\n",$2,"\n",$4
else
    print $1,"\n",$2,"\n",$3 ;
}' short.txt > test

With output: 输出：

Sequence:
 X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
 from:
 Sequence:
 X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
 1
Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
 Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
 Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
 AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
 M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

So I thought I should remove the first "Sequence: " string from the input file, but that gives: 所以我认为我应该从输入文件中删除第一个“ Sequence：”字符串，但这给出了：

X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
 from:
 1
 X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
 from:
 to:
Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
 Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
 Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
 AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
 M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

So again the first record is messed up. 因此，第一张记录又被弄乱了。 Is there a solution to this problem? 有解决这个问题的方法吗？ My expected output is as the last output (with or without the string "Sequence :"), but with the first record correct. 我的预期输出是最后一个输出（带或不带字符串“ Sequence：”），但第一个记录正确。

Answer 1

It sounds like this is what you're trying to do: 听起来这就是您要执行的操作：

$ cat tst.awk
/^Sequence/ { if (NR>1) prt() }
{ rec[++cnt] = $0 }
END { prt() }
function prt() {
    print rec[1] ORS rec[2] ORS rec[3]
    if (cnt == 4) {
        print rec[1] ORS rec[2] ORS rec[4]
    }
    cnt=0
}

$ awk -f tst.awk file
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

Trying to use a RS for this just makes your life harder and the resulting code non-portable (gawk-only) 尝试为此使用RS只会使您的生活更艰难，并且所生成的代码不可移植（仅适用于gawk）

Answer 2

Your code can be easily fixed as: 您的代码可以很容易地固定为：

BEGIN{ RS="Sequence: "; FS="\n" }
(NR==1){next}
{
if ($4 != "" )
    print $1,"\n",$2,"\n",$3,"\n",$1,"\n",$2,"\n",$4
else
    print $1,"\n",$2,"\n",$3 ;
}

The first record would be empty, that is why it is skipped with next . 第一个记录将为空，这就是为什么它与next跳过的原因。

The reason you had problems with your first record is that you defined RS and FS after the first record was read (ie not in a BEGIN block which occurs before anything is done at all) 您对第一条记录有问题的原因是，您在读取第一条记录后定义了RS和FS （即，根本不在进行任何操作之前发生的BEGIN块中）

But what you really want, just to be sure, is RS="(^|\\n)Sequence: " This just to be sure that it starts at the beginning of the line or the file. 但是可以肯定的是，您真正想要的是RS="(^|\\n)Sequence: "这只是为了确保它从行或文件的开头开始。

如何将带有字符串的AWK用作RS？

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-09-06 13:57:21

解决方案2
1 2018-09-06 15:49:34

如何将带有字符串的AWK用作RS？

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-09-06 13:57:21

解决方案2 1 2018-09-06 15:49:34

解决方案1
2 已采纳 2018-09-06 13:57:21

解决方案2
1 2018-09-06 15:49:34