How to grep specific letters in a sequence using grep

Question

I have a file containing this form of information:

>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold2|size68281
ATAGAGATGAGACAGATGACAGANNNNAGATAGATAGAGCAGATAGACANNNNAGATAGAG
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS

and so on

But I guess there is something abnormal in the sequence, so what I want to is to grep all the lettres that are not A, C, T, G or N in all the lines after scaffold (I want to search just in the lines where the sequence is not in the line >scaffold-size ).
In the example above it will grep YYYYYYYYYYYYYYYYYY after scaffold3 and SSSSSSSSSSSSS in scaffold 4.
I hope I'm clear enough, please if you need any clarification tell me.

Thank you

Answer 1

Could you please try following, considering that you don't want empty lines then try following.

awk '!/^>/{gsub(/[ACTGN]/,"");if(NF){print}}'  Input_file

Explanation: Adding detailed explanation for above code here.

awk '                    ##Starting awk program from here.
!/^>/{                   ##Checking condition if a line does not starts from > then do following.
  gsub(/[ACTGN]/,"")     ##Globally substituting A,C,T,G,N will NULL in lines here.
  if(NF){                ##Checking condition if current is NOT NULL after substitution then do following.
    print                ##Print the current line.
  }
}
'  Input_file            ##Mentioning Input_file name here.

Output will be as follows.

S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS

Answer 2

Let's assume you don't just need to know which sequences contain invalid characters - you also want to know which scaffold each sequence belongs to. This can be done; how to do it depends on the exact output format you need, and also on the exact structure of the data.

Just for illustration, I will make the following simplifying assumptions: the "sequences" may only contain uppercase letters (which may be the valid ones or invalid ones - but there can't be punctuation marks, or digits, etc.); and the labels (the rows that begin with a > ) don't contain any uppercase letters. Note - if the sequences only contain letters, then it's not too hard to pre-process the file to convert the sequences to all-uppercase and the labels to all-lowercase, so the solution below will still work.

In some versions of GREP, the invalid characters will appear in a different color (see the linked image). I find this quite helpful.

grep --no-group-separator -B 1 '[BDEFHIJKLMOPQRSUVWXYZ]' input_file

OUTPUT:

>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS

Answer 3

use grep -v to remove the scaffold lines, and use grep -oP to select the segments of undesired letters.

cat test.txt | grep -v '^>' | grep -oP '[^ACGTN]+'

output from the sample data

S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS

How to grep specific letters in a sequence using grep

Question

3 answers

solution1
2 2020-04-19 14:30:09

solution2
1 ACCPTED 2020-04-20 14:40:03

solution3
0 2020-04-19 14:17:38

How to grep specific letters in a sequence using grep

Question

3 answers

solution1 2 2020-04-19 14:30:09

solution2 1 ACCPTED 2020-04-20 14:40:03

solution3 0 2020-04-19 14:17:38

solution1
2 2020-04-19 14:30:09

solution2
1 ACCPTED 2020-04-20 14:40:03

solution3
0 2020-04-19 14:17:38