简体   繁体   English

匹配两行中的字符数

[英]Match the number of characters in two lines

I have a file that I'm trying to prepare for some downstream analysis, but I need the number of characters in two lines to be identical. 我有一个文件正在准备进行下游分析,但是我需要两行中的字符数相同。 The file is formatted as below, where the 2nd ( CTTATAATGCCGCTCCCTAAG ) and 4th ( bbbeeeeegggggiiiiiiiiigghiiiiiiiiiiiiiiiiiigeccccb ) lines need to contain the same number of characters. 该文件的格式如下,其中第二行( CTTATAATGCCGCTCCCTAAG )和第四行( bbbeeeeegggggiiiiiiiiigghiiiiiiiiiiiiiiiiiigeccccb )需要包含相同数量的字符。

@HWI-ST:8:1101:3346:2198#GTCCGC/1
CTTATAATGCCGCTCCCTAAG
+HWI-ST:8:1101:3346:2198#GTCCGC/1
bbbeeeeegggggiiiiiiiiigghiiiiiiiiiiiiiiiiiigeccccb
@HWI-ST:8:1101:10491:2240#GTCCGC/1
GAGTAGGGAGTATACATCAG
+HWI-ST:8:1101:10491:2240#GTCCGC/1
abbceeeeggggfiiiiiigg`gfhfhhifhifdgg^ggdf_`_Y[aa_R
@HWI-ST:8:1101:19449:2134#GTCCGC/1
AAGAAGAGATCTGTGGACCA

So far I've pulled out the second line from each set of four and generated a file containing a record of the length of each line using: 到目前为止,我已经从每四行中抽出第二行,并使用以下命令生成了一个包含每行长度记录的文件:

grep -v '[^A-Z]' file.fastq |awk '{ print length($0); }' > newfile

Now I'm just looking for a way to point to this record to direct a sed command as to how many characters to trim off of the end of the line. 现在,我只是在寻找一种指向该记录的方法,以指示sed命令从行尾删除多少个字符。 Something similar to: 类似于:

sed -r 's/.{n}$//' file

Replacing n with some regular expression to reference the text file. 用某些正则表达式替换n以引用文本文件。 I wonder if I'm overcomplicating things, but I need the lines to match EXACTLY so I haven't been able to think of another way to go about it. 我想知道我是否使事情复杂化了,但是我需要线条完全匹配,所以我无法想到另一种解决方法。 Any help would be awesome, thanks! 任何帮助都会很棒,谢谢!

This might be what you're looking for: 这可能是您要寻找的:

awk '
  # If 2nd line of 4-line group, save length as len.
  NR % 4 == 2 { len = length($0) }

  # If 4th line of 4-line group, trim the line to len.
  NR % 4 == 0 { $0 = substr($0, 1, len)}

  # print every line
  { print }
' file

This assumes that the file consists of 4-line groups where the 2nd and 4th line of each group are the ones you're interested in. It also assumes that the 2nd line of each group will be no longer than its corresponding 4th line. 假定文件由4行组成,其中每个组的第二行和第四行都是您感兴趣的组。还假定每个组的第二行不超过其对应的第四行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM