简体   繁体   中英

Remove multiple sequences from fasta file

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:

>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

In an other file I have a list of headers of sequences that I would like to remove, like this:

>header1
>header5
>header12
[...]
>header145

The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,

while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt

It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?

$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001 .

Alternatively:

$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

f is whether or not the most recently read >... line was found in the target array a[] . f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.

The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with > . If you know both then which one you use is a style choice.

您可以使用以下awk

awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt

The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:

  • The description line (defline) or header/identifier line, which begins with <greater-then> character ( > ), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
  • Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
  • The sequence can span multiple lines.
  • A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.

Most of the presented methods will fail on a multi-fasta with multi-line sequences

The following will work always:

awk '(NR==FNR) { toRemove[$1]; next }
     /^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
    p' headers.txt file.fasta

This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.

One option is to create a long sed expression:

sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt

This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123 ...)


Using a file (as @daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.

Create a script with the delete commands from the second file:

sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed

Then apply that file to the first

sed -f commands.sed firstFile.txt 

这个awk可能对您awk

awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1

try gnu sed,

sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f -  first_file.txt

prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's

This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.

filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM