How to extract specific parts of a reference text, based on a list of identifiers?

I have a reference file (.fasta) and a list of gene IDs. For each ID in the gene ID list, I need to get the corresponding sequence into a text file. How can I automate this?

Things I've tried so far:

  1. sed

sed -n -e '/{GENEID1}/,/>/p' referencefile.fasta | sed $d >> seqs.txt

with '>' being the character at which I'd like sed to stop. I need the second sed to remove the last line, which grabs the first line of the next sequence, too. This works if I just run it once, but if I try

cat geneID.txt | xargs sed -n -e '/{}/,/>/p' referencefile.fasta >> seqs.txt

then I get just a list of IDs, with no sequences. It also takes super long, so I assume sed is reading through the reference file, but I don't see why it won't grab the sequences?

  1. grep

grep -o -P '(?={GENEID}).*(?=>)

Here I have the same issue - works individually, but not with xargs or a loop.

  1. using a for loop

     for LINE in $(cat geneIDs.txt); do echo $LINE >> seqs.txt sed -n -e '/$LINE/,/>/p' referencefile.fasta | sed $d >> seqs.txt done

I'm also open to trying something in python, though I'm not that well-versed in it yet. My preliminary attempt has been based on this question here . I have a test ID list of 10 lines, which I tried to run like this:

t = open('test.txt', 'r')
test = t.readlines()
test = test.split()

with open('referencefile.fasta', 'r') as ref:
    for line in ref:
        for i in test:
            if i in line:

This one, I couldn't even get a sequence from the reference file, regardless of the loop.

Can you guys spot the issue? Why won't any of these give me sequences?

Thanks in advance!

Edited to add:

Example reference:





test IDs: 000000F, 000001F

Ideal result:

000000F ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg


Current result:

000000F 000001F

If there is always a single line after one geneId in your fasta file, this will help:

grep -A1 -Fwf geneIds.txt input.fasta

check this example:

$  head -n 20 *
==> ids.txt <==

==> input.fasta <==
Yes I want it!

Yes I want it too!



kent$  grep -A1 -Fwf ids.txt input.fasta
Yes I want it!
Yes I want it too!

depending on size and access patterns and what else you may use the sequence for it may be easiest to just build a BLAST database, then feed it your identifiers and it will return exactly what you are asking for (except correctly formatted FASTA).

pros are it is well designed, tested and fast

cons are it may be overkill for your task

(but still super useful if you will be continuing to work in this space)



$ cat file


With awk you can read data separated by two or more \n in paragraph mode . This allows you to easily build an associative database of a file in that format.

Example, search by exact string:

awk -v RS= -v FS="\n" -v q=">000000F" '$1==q{print $2}' file

Or search by regex:

awk -v RS= -v FS="\n" -v q="[01]F$" '$1~q {print $2}' file

Or, build an associative array:

awk -v RS= -v FS="\n"   '{arr[$1]=$2} END{ "do something with the data in arr" }' file

Which you could use to print from a file with a list of ids:

cat ids

awk -v RS= -v FS="\n"  'FNR==NR{for(i=1; i<=NF; i++) ids[$i]; next}
$1 in ids{print $2}' ids file

