Remove multiple sequences from fasta file

Question

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:

>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

In an other file I have a list of headers of sequences that I would like to remove, like this:

>header1
>header5
>header12
[...]
>header145

The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,

while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt

It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?

Answer 1

$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001 .

Alternatively:

$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

f is whether or not the most recently read >... line was found in the target array a[] . f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.

The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with > . If you know both then which one you use is a style choice.

Answer 2

您可以使用以下awk ：

awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt

Answer 3

The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:

The description line (defline) or header/identifier line, which begins with <greater-then> character ( > ), gives a name and/or a unique identifier for the sequence, and may also contain additional information.

Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).

The sequence can span multiple lines.

A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.

Most of the presented methods will fail on a multi-fasta with multi-line sequences

The following will work always:

awk '(NR==FNR) { toRemove[$1]; next }
     /^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
    p' headers.txt file.fasta

This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.

Answer 4

One option is to create a long sed expression:

sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt

This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123 ...)

Using a file (as @daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.

Answer 5

Create a script with the delete commands from the second file:

sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed

Then apply that file to the first

sed -f commands.sed firstFile.txt

Answer 6

这个awk可能对您awk ：

awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1

Answer 7

try gnu sed,

sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f -  first_file.txt

prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's

Answer 8

This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.

filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

Remove multiple sequences from fasta file

Question

8 answers

solution1
1 2019-04-11 15:49:15

solution2
1 2019-04-11 15:54:56

solution3
1 2019-04-11 16:29:47

solution4
0 2019-04-11 15:38:35

solution5
0 ACCPTED 2019-04-11 15:41:13

solution6
0 2019-04-11 15:43:27

solution7
0 2019-04-12 14:24:17

solution8
0 2022-11-22 02:35:52

Remove multiple sequences from fasta file

Question

8 answers

solution1 1 2019-04-11 15:49:15

solution2 1 2019-04-11 15:54:56

solution3 1 2019-04-11 16:29:47

solution4 0 2019-04-11 15:38:35

solution5 0 ACCPTED 2019-04-11 15:41:13

solution6 0 2019-04-11 15:43:27

solution7 0 2019-04-12 14:24:17

solution8 0 2022-11-22 02:35:52

solution1
1 2019-04-11 15:49:15

solution2
1 2019-04-11 15:54:56

solution3
1 2019-04-11 16:29:47

solution4
0 2019-04-11 15:38:35

solution5
0 ACCPTED 2019-04-11 15:41:13

solution6
0 2019-04-11 15:43:27

solution7
0 2019-04-12 14:24:17

solution8
0 2022-11-22 02:35:52