简体   繁体   中英

How to grep specific letters in a sequence using grep

I have a file containing this form of information:

>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold2|size68281
ATAGAGATGAGACAGATGACAGANNNNAGATAGATAGAGCAGATAGACANNNNAGATAGAG
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS

and so on

But I guess there is something abnormal in the sequence, so what I want to is to grep all the lettres that are not A, C, T, G or N in all the lines after scaffold (I want to search just in the lines where the sequence is not in the line >scaffold-size ).
In the example above it will grep YYYYYYYYYYYYYYYYYY after scaffold3 and SSSSSSSSSSSSS in scaffold 4.
I hope I'm clear enough, please if you need any clarification tell me.

Thank you

Could you please try following, considering that you don't want empty lines then try following.

awk '!/^>/{gsub(/[ACTGN]/,"");if(NF){print}}'  Input_file

Explanation: Adding detailed explanation for above code here.

awk '                    ##Starting awk program from here.
!/^>/{                   ##Checking condition if a line does not starts from > then do following.
  gsub(/[ACTGN]/,"")     ##Globally substituting A,C,T,G,N will NULL in lines here.
  if(NF){                ##Checking condition if current is NOT NULL after substitution then do following.
    print                ##Print the current line.
  }
}
'  Input_file            ##Mentioning Input_file name here.

Output will be as follows.

S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS

Let's assume you don't just need to know which sequences contain invalid characters - you also want to know which scaffold each sequence belongs to. This can be done; how to do it depends on the exact output format you need, and also on the exact structure of the data.

Just for illustration, I will make the following simplifying assumptions: the "sequences" may only contain uppercase letters (which may be the valid ones or invalid ones - but there can't be punctuation marks, or digits, etc.); and the labels (the rows that begin with a > ) don't contain any uppercase letters. Note - if the sequences only contain letters, then it's not too hard to pre-process the file to convert the sequences to all-uppercase and the labels to all-lowercase, so the solution below will still work.

In some versions of GREP, the invalid characters will appear in a different color (see the linked image). I find this quite helpful.

grep --no-group-separator -B 1 '[BDEFHIJKLMOPQRSUVWXYZ]' input_file

OUTPUT:

>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS

在此处输入图像描述

use grep -v to remove the scaffold lines, and use grep -oP to select the segments of undesired letters.

cat test.txt | grep -v '^>' | grep -oP '[^ACGTN]+'

output from the sample data

S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM