简体   繁体   English

如何使用 grep 以序列中的特定字母 grep

[英]How to grep specific letters in a sequence using grep

I have a file containing this form of information:我有一个文件包含这种形式的信息:

>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold2|size68281
ATAGAGATGAGACAGATGACAGANNNNAGATAGATAGAGCAGATAGACANNNNAGATAGAG
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS

and so on等等

But I guess there is something abnormal in the sequence, so what I want to is to grep all the lettres that are not A, C, T, G or N in all the lines after scaffold (I want to search just in the lines where the sequence is not in the line >scaffold-size ).但我想序列中有一些异常,所以我想要的是 grep 所有不是 A、C、T、G 或 N 的字母在脚手架之后的所有行中(我想在该序列不在 >scaffold-size 行中)。
In the example above it will grep YYYYYYYYYYYYYYYYYY after scaffold3 and SSSSSSSSSSSSS in scaffold 4.在上面的示例中,它将在脚手架 3 和脚手架 4 中的 SSSSSSSSSSSSS 之后为 grep YYYYYYYYYYYYYYYYYY。
I hope I'm clear enough, please if you need any clarification tell me.我希望我足够清楚,如果您需要任何澄清,请告诉我。

Thank you谢谢

Could you please try following, considering that you don't want empty lines then try following.考虑到您不想要空行,请尝试关注,然后尝试关注。

awk '!/^>/{gsub(/[ACTGN]/,"");if(NF){print}}'  Input_file

Explanation: Adding detailed explanation for above code here.说明:在此处添加上述代码的详细说明。

awk '                    ##Starting awk program from here.
!/^>/{                   ##Checking condition if a line does not starts from > then do following.
  gsub(/[ACTGN]/,"")     ##Globally substituting A,C,T,G,N will NULL in lines here.
  if(NF){                ##Checking condition if current is NOT NULL after substitution then do following.
    print                ##Print the current line.
  }
}
'  Input_file            ##Mentioning Input_file name here.

Output will be as follows. Output 如下。

S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS

Let's assume you don't just need to know which sequences contain invalid characters - you also want to know which scaffold each sequence belongs to.假设您不仅需要知道哪些序列包含无效字符 - 您还想知道每个序列属于哪个脚手架。 This can be done;这是可以做到的; how to do it depends on the exact output format you need, and also on the exact structure of the data.如何做到这一点取决于您需要的确切 output 格式,以及数据的确切结构。

Just for illustration, I will make the following simplifying assumptions: the "sequences" may only contain uppercase letters (which may be the valid ones or invalid ones - but there can't be punctuation marks, or digits, etc.);为了说明,我将做以下简化假设:“序列”可能只包含大写字母(可能是有效的无效的——但不能有标点符号或数字等); and the labels (the rows that begin with a > ) don't contain any uppercase letters.并且标签(以>开头的行)不包含任何大写字母。 Note - if the sequences only contain letters, then it's not too hard to pre-process the file to convert the sequences to all-uppercase and the labels to all-lowercase, so the solution below will still work.注意 - 如果序列只包含字母,那么预处理文件以将序列转换为全大写并将标签转换为全小写并不难,因此下面的解决方案仍然有效。

In some versions of GREP, the invalid characters will appear in a different color (see the linked image).在 GREP 的某些版本中,无效字符将以不同的颜色显示(参见链接图像)。 I find this quite helpful.我觉得这很有帮助。

grep --no-group-separator -B 1 '[BDEFHIJKLMOPQRSUVWXYZ]' input_file

OUTPUT: OUTPUT:

>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS

在此处输入图像描述

use grep -v to remove the scaffold lines, and use grep -oP to select the segments of undesired letters.使用grep -v删除脚手架线,并使用 grep -oP 到 select 不需要的字母段。

cat test.txt | grep -v '^>' | grep -oP '[^ACGTN]+'

output from the sample data output 来自样本数据

S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM