如何将带有字符串的AWK用作RS？

Question

我想使用AWK，但似乎没有正确的第一张唱片。 我希望任何人都可以帮助正确解决。

我有这个文件，每条记录是3行，但有时它有4行（所以有$ 3和$ 4）。 我的目标是打印每条记录的所有三行，如果有第四行，我还要打印前两行和第四行（不打印第三行）。

我的策略是使用字符串（“ Sequence：”）作为RS，并为FS使用换行（“ \\ n”）。

我的文件如下所示：

Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

通过以下代码，我得到了一条混乱的第一条记录，因为该字符串也位于文件的开头。

awk '{ RS="Sequence: "; FS="\n" }
{
if ($4 != "" )
    print $1,"\n",$2,"\n",$3,"\n",$1,"\n",$2,"\n",$4
else
    print $1,"\n",$2,"\n",$3 ;
}' short.txt > test

输出：

Sequence:
 X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
 from:
 Sequence:
 X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
 1
Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
 Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
 Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
 AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
 M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

所以我认为我应该从输入文件中删除第一个“ Sequence：”字符串，但这给出了：

X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
 from:
 1
 X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
 from:
 to:
Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
 Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
 Start     End  Strand Pattern                 Mismatch Sequence
 184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
 AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
 M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
 Start     End  Strand Pattern                 Mismatch Sequence
 178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

因此，第一张记录又被弄乱了。 有解决这个问题的方法吗？ 我的预期输出是最后一个输出（带或不带字符串“ Sequence：”），但第一个记录正确。

Answer 1

听起来这就是您要执行的操作：

$ cat tst.awk
/^Sequence/ { if (NR>1) prt() }
{ rec[++cnt] = $0 }
END { prt() }
function prt() {
    print rec[1] ORS rec[2] ORS rec[3]
    if (cnt == 4) {
        print rec[1] ORS rec[2] ORS rec[4]
    }
    cnt=0
}

$ awk -f tst.awk file
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__     from: 1   to: 290
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___     from: 1   to: 299
Start     End  Strand Pattern                 Mismatch Sequence
184     192       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
150     158       + pattern:AA[CT]NNN[AT]CN        . aatcaatca
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___     from: 1   to: 293
Start     End  Strand Pattern                 Mismatch Sequence
178     186       + pattern:AA[CT]NNN[AT]CN        . aacccgtcc

尝试为此使用RS只会使您的生活更艰难，并且所生成的代码不可移植（仅适用于gawk）

Answer 2

您的代码可以很容易地固定为：

BEGIN{ RS="Sequence: "; FS="\n" }
(NR==1){next}
{
if ($4 != "" )
    print $1,"\n",$2,"\n",$3,"\n",$1,"\n",$2,"\n",$4
else
    print $1,"\n",$2,"\n",$3 ;
}

第一个记录将为空，这就是为什么它与next跳过的原因。

您对第一条记录有问题的原因是，您在读取第一条记录后定义了RS和FS （即，根本不在进行任何操作之前发生的BEGIN块中）

但是可以肯定的是，您真正想要的是RS="(^|\\n)Sequence: "这只是为了确保它从行或文件的开头开始。

如何将带有字符串的AWK用作RS？

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-09-06 13:57:21

解决方案2
1 2018-09-06 15:49:34

如何将带有字符串的AWK用作RS？

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-09-06 13:57:21

解决方案2 1 2018-09-06 15:49:34

解决方案1
2 已采纳 2018-09-06 13:57:21

解决方案2
1 2018-09-06 15:49:34