![](/img/trans.png)
[英]Select first match between two patterns.Restart search if a 3rd pattern is found using sed/awk/grep
[英]GREP or AWK: Search in the first N characters of each line, and output surrounding lines that match pattern
我有一個RNA-seq數據,看起來像這樣:
@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
@J00157:85:HNNJLBBXX:5:1101:12550:15574 1:N:0:ATTACTCG+TATAGCCT
GCTCTTCCGATCTGCTATTGATGACTGTCCTCTGTTCTTTCTTTCACAGTAGACGAGGACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTC
+
AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
--
如果我們將@之后的所有內容視為一個部分,則只能看到第二行是實際的排序信息,1,3,4,5是后勤/質量信息。
目標是提取每行的前N(N = 35)個字符中包含“ GCTGCA”的序列(第二行信息) ,並同時輸出周圍的行(匹配的行前面1行,后面3行) 。
答案示例是
@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
我嘗試過的是
awk 'substr($0, 1, 35) ~ "GCTGCA"' filename.fastq > newfile.fastq
grep -B 1 -A 2 -E GCTGCA filename.fastq > newfile.fastq
awk '{a[++i]=$0;}{substr(a[++i], 1, 35) ~ "GCTGCA"}{for(j=NR-1;j<=NR+2;j++)print a[j];}' filename.fastq > newfile.fastq
第一個不能輸出周圍的線。 第二個不能將模式匹配限制在每行的前35個字母中。 第三行應該可以,但是它給了我有線輸出(顯然是不正確的):
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
+
+
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
--
--
帶有gawk
多字符RS支持。
awk -v RS='\n--' -F'\n' 'substr($2,0,35)~"GCTGCA"{print $0 RS}' file
您可以使用記錄分隔符定義記錄。
使用getline
使用awk
:
search.awk :
substr($0,0,35)~"GCTGCA" {
print p # Print the previous line ...
print # ... , current line ...
for(i=0;i<=2;i++) { # ... and the 3 lines following it
getline
print
}
}
# Store the previous line
{ p = $0 }
這樣稱呼它:
awk -f search.awk input_file
或不帶正則表達式並帶有參數:
search.awk
index(substr($0,0,35), search) {
print l
print
for(i=0;i<=2;i++) {
getline
print
}
}
{ l = $0 }
像這樣稱呼它
awk -v search="GCTGCA" -f search.awk input_file
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.