[英]GREP or AWK: Search in the first N characters of each line, and output surrounding lines that match pattern
I have a RNA-seq data that looks like this: 我有一个RNA-seq数据,看起来像这样:
@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
@J00157:85:HNNJLBBXX:5:1101:12550:15574 1:N:0:ATTACTCG+TATAGCCT
GCTCTTCCGATCTGCTATTGATGACTGTCCTCTGTTCTTTCTTTCACAGTAGACGAGGACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTC
+
AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
--
If we treat all content after @ as a section, you can see only the second line is the real sequencing information, 1,3,4,5 are logistic/quality information. 如果我们将@之后的所有内容视为一个部分,则只能看到第二行是实际的排序信息,1,3,4,5是后勤/质量信息。
The goal is to extract sequences (second line information) that containing "GCTGCA" in the first N (N=35) characters each line , and at the same time output the surrounding lines (1 line ahead, 3 line behind the matched line) . 目标是提取每行的前N(N = 35)个字符中包含“ GCTGCA”的序列(第二行信息) ,并同时输出周围的行(匹配的行前面1行,后面3行) 。
An example answer is 答案示例是
@J00157:85:HNNJLBBXX:5:1101:2869:15047 1:N:0:ATTACTCG+TATAGCCT
CGACGCTCTTCCGATCTGAGCTGCAGCCTCGGCCCCAGGATCCCCCTGGGGGACTGGACGCTGCTATTGATTCACGAGGCGCTCAGATCGGAAGAGCACAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJFJJJJJJJJFJJJFJFJJJJJJJJJJJJJJJJ
--
What I have tried are 我尝试过的是
awk 'substr($0, 1, 35) ~ "GCTGCA"' filename.fastq > newfile.fastq
grep -B 1 -A 2 -E GCTGCA filename.fastq > newfile.fastq
awk '{a[++i]=$0;}{substr(a[++i], 1, 35) ~ "GCTGCA"}{for(j=NR-1;j<=NR+2;j++)print a[j];}' filename.fastq > newfile.fastq
The first one cannot output surrounding lines. 第一个不能输出周围的线。 The second one cannot limit pattern-matching in the first 35 letters of each line.
第二个不能将模式匹配限制在每行的前35个字母中。 The third line should work, but it gives me wired output (which obviously is not correct):
第三行应该可以,但是它给了我有线输出(显然是不正确的):
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
@J00157:85:HNNJLBBXX:5:1101:14235:1367 1:N:0:ATTACTCG+TATAGCCT
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
TCTNCTCTTCCGATCTACCCCACACACCCCCGCCGCCGCCGCCGCCGCCGCCCTCCGACGCACACCACACGCGCGCGCGCGCGCGCCGCCCCCGCCGCTCC
+
+
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
AAF#FJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFJJAJJJJJFJJJJ7JJ
--
--
with gawk
multi-char RS support. 带有
gawk
多字符RS支持。
awk -v RS='\n--' -F'\n' 'substr($2,0,35)~"GCTGCA"{print $0 RS}' file
you define the record with the record separator. 您可以使用记录分隔符定义记录。
With awk
using getline
: 使用
getline
使用awk
:
search.awk : search.awk :
substr($0,0,35)~"GCTGCA" {
print p # Print the previous line ...
print # ... , current line ...
for(i=0;i<=2;i++) { # ... and the 3 lines following it
getline
print
}
}
# Store the previous line
{ p = $0 }
Call it like this: 这样称呼它:
awk -f search.awk input_file
Or without regular expressions and with a parameter: 或不带正则表达式并带有参数:
search.awk search.awk
index(substr($0,0,35), search) {
print l
print
for(i=0;i<=2;i++) {
getline
print
}
}
{ l = $0 }
Call it like 像这样称呼它
awk -v search="GCTGCA" -f search.awk input_file
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.