[英]How to use AWK with string as a RS?
我想使用AWK,但似乎沒有正確的第一張唱片。 我希望任何人都可以幫助正確解決。
我有這個文件,每條記錄是3行,但有時它有4行(所以有$ 3和$ 4)。 我的目標是打印每條記錄的所有三行,如果有第四行,我還要打印前兩行和第四行(不打印第三行)。
我的策略是使用字符串(“ Sequence:”)作為RS,並為FS使用換行(“ \\ n”)。
我的文件如下所示:
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
通過以下代碼,我得到了一條混亂的第一條記錄,因為該字符串也位於文件的開頭。
awk '{ RS="Sequence: "; FS="\n" }
{
if ($4 != "" )
print $1,"\n",$2,"\n",$3,"\n",$1,"\n",$2,"\n",$4
else
print $1,"\n",$2,"\n",$3 ;
}' short.txt > test
輸出:
Sequence:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
Sequence:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
1
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
所以我認為我應該從輸入文件中刪除第一個“ Sequence:”字符串,但這給出了:
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
1
X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__
from:
to:
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
因此,第一張記錄又被弄亂了。 有解決這個問題的方法嗎? 我的預期輸出是最后一個輸出(帶或不帶字符串“ Sequence:”),但第一個記錄正確。
聽起來這就是您要執行的操作:
$ cat tst.awk
/^Sequence/ { if (NR>1) prt() }
{ rec[++cnt] = $0 }
END { prt() }
function prt() {
print rec[1] ORS rec[2] ORS rec[3]
if (cnt == 4) {
print rec[1] ORS rec[2] ORS rec[4]
}
cnt=0
}
$ awk -f tst.awk file
Sequence: X92272_IGHV4-31*08_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: X92273_IGHV4-31*09_Homosapiens_F_V-REGION_140..429_290nt_1_____290+0=290_partialin3'__ from: 1 to: 290
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: Z14235_IGHV4-31*10_Homosapiens_F_V-REGION_140..438_299nt_1_____299+0=299___ from: 1 to: 299
Start End Strand Pattern Mismatch Sequence
184 192 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
Sequence: AB019439_IGHV4-34*01_Homosapiens_F_V-REGION_59657..59949_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
150 158 + pattern:AA[CT]NNN[AT]CN . aatcaatca
Sequence: M99684_IGHV4-34*02_Homosapiens_F_V-REGION_311..603_293nt_1_____293+0=293___ from: 1 to: 293
Start End Strand Pattern Mismatch Sequence
178 186 + pattern:AA[CT]NNN[AT]CN . aacccgtcc
嘗試為此使用RS只會使您的生活更艱難,並且所生成的代碼不可移植(僅適用於gawk)
您的代碼可以很容易地固定為:
BEGIN{ RS="Sequence: "; FS="\n" }
(NR==1){next}
{
if ($4 != "" )
print $1,"\n",$2,"\n",$3,"\n",$1,"\n",$2,"\n",$4
else
print $1,"\n",$2,"\n",$3 ;
}
第一個記錄將為空,這就是為什么它與next
跳過的原因。
您對第一條記錄有問題的原因是,您在讀取第一條記錄后定義了RS
和FS
(即,根本不在進行任何操作之前發生的BEGIN
塊中)
但是可以肯定的是,您真正想要的是RS="(^|\\n)Sequence: "
這只是為了確保它從行或文件的開頭開始。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.