I have a file containing this form of information:
>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold2|size68281
ATAGAGATGAGACAGATGACAGANNNNAGATAGATAGAGCAGATAGACANNNNAGATAGAG
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS
and so on
But I guess there is something abnormal in the sequence, so what I want to is to grep all the lettres that are not A, C, T, G or N in all the lines after scaffold (I want to search just in the lines where the sequence is not in the line >scaffold-size ).
In the example above it will grep YYYYYYYYYYYYYYYYYY after scaffold3 and SSSSSSSSSSSSS in scaffold 4.
I hope I'm clear enough, please if you need any clarification tell me.
Thank you
Could you please try following, considering that you don't want empty lines then try following.
awk '!/^>/{gsub(/[ACTGN]/,"");if(NF){print}}' Input_file
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!/^>/{ ##Checking condition if a line does not starts from > then do following.
gsub(/[ACTGN]/,"") ##Globally substituting A,C,T,G,N will NULL in lines here.
if(NF){ ##Checking condition if current is NOT NULL after substitution then do following.
print ##Print the current line.
}
}
' Input_file ##Mentioning Input_file name here.
Output will be as follows.
S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS
Let's assume you don't just need to know which sequences contain invalid characters - you also want to know which scaffold each sequence belongs to. This can be done; how to do it depends on the exact output format you need, and also on the exact structure of the data.
Just for illustration, I will make the following simplifying assumptions: the "sequences" may only contain uppercase letters (which may be the valid ones or invalid ones - but there can't be punctuation marks, or digits, etc.); and the labels (the rows that begin with a >
) don't contain any uppercase letters. Note - if the sequences only contain letters, then it's not too hard to pre-process the file to convert the sequences to all-uppercase and the labels to all-lowercase, so the solution below will still work.
In some versions of GREP, the invalid characters will appear in a different color (see the linked image). I find this quite helpful.
grep --no-group-separator -B 1 '[BDEFHIJKLMOPQRSUVWXYZ]' input_file
OUTPUT:
>scaffold1|size69534
ACATAAGAGSTGATGATAGATAGATGCAGATGACAGATGANNGTGANNNNNNNNNNNNNTAGAT
>scaffold3|size67203
ATAGAGTAGAGAGAGAGTACAGATAGAGGAGAGAGATAGACNNNNNNACATYYYYYYYYYYYYYYYYY
>scaffold4|size66423
ACAGATAGCAGATAGACAGATNNNNNNNAGATAGTAGACSSSSSSSSSS
use grep -v
to remove the scaffold lines, and use grep -oP to select the segments of undesired letters.
cat test.txt | grep -v '^>' | grep -oP '[^ACGTN]+'
output from the sample data
S
YYYYYYYYYYYYYYYYY
SSSSSSSSSS
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.