简体   繁体   English

使用Bash手动编辑文本或Fastq文件

[英]Using Bash to Manually Edit a Text or Fastq file

I would like to manually edit a Fastq file using Bash to multiple similar lines. 我想使用Bash手动将Fastq文件编辑为多个相似的行。

In Fastq files a sequence read starts on line 2 and then is found every fourth line (ie lines 2,6,10,14...). 在Fastq文件中,序列读取从第2行开始,然后每四行找到一次(即第2、6、10、14等行)。

I would like to create an edited text file that is identical to a Fastq file except the first 6 characters of the sequencing reads are trimmed off. 我想创建一个与Fastq文件相同的已编辑文本文件,不同之处在于,将修剪掉序列读取的前6个字符。

Unedited Fastq: 未经编辑的Fastq:

@M03017:21:000000000
GAGAGATCTCTCTCTCTCTCT
+
111>>B1FDFFF

Edited Fastq: 编辑的Fastq:

@M03017:21:000000000
TCTCTCTCTCTCTCT
+
111>>B1FDFFF

I guess awk is perfect for this: 我猜awk非常适合:

$ awk 'NR%4==2 {gsub(/^.{6}/,"")} 1' file
@M03017:21:000000000
TCTCTCTCTCTCTCT
+
111>>B1FDFFF

This removes the first 6 characters in all the lines in the 4k+2 position. 这将删除4k + 2位置的所有行中的前6个字符。

Explanation 说明

  • NR%4==2 {} do things if the number of record (number of line) is on 4k+2 form. NR%4==2 {}如果记录数(行数)为4k + 2格式,则执行操作。
  • gsub(/^.{6}/,"") replace the 6 first chars with empty string. gsub(/^.{6}/,"")用空字符串替换前6个字符。
  • 1 as evaluated to True, print the line. 1评估为True,打印该行。

GNU sed can do that: GNU sed可以做到:

sed -i~ '2~4s/^.\{6\}//' file

The address 2~4 means "start on line 2, repeat each 4 lines". 地址2~4表示“从第2行开始,每4行重复一次”。

s means replace, ^ matches the line beginning, . s表示替换, ^与行开头匹配. matches any character, \\{6\\} specifies the length (a "quantifier"). 匹配任何字符, \\{6\\}指定长度(“量化符”)。 The replacement string is empty ( // ). 替换字符串为空( // )。

-i~ replaces the file in place, leaving a backup with the ~ appended to the filename. -i~替换的地方文件,而与备份~附加到文件名。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM