简体   繁体   中英

How to extract specific parts of a reference text, based on a list of identifiers?

I have a reference file (.fasta) and a list of gene IDs. For each ID in the gene ID list, I need to get the corresponding sequence into a text file. How can I automate this?

Things I've tried so far:

  1. sed

sed -n -e '/{GENEID1}/,/>/p' referencefile.fasta | sed $d >> seqs.txt

with '>' being the character at which I'd like sed to stop. I need the second sed to remove the last line, which grabs the first line of the next sequence, too. This works if I just run it once, but if I try

cat geneID.txt | xargs sed -n -e '/{}/,/>/p' referencefile.fasta >> seqs.txt

then I get just a list of IDs, with no sequences. It also takes super long, so I assume sed is reading through the reference file, but I don't see why it won't grab the sequences?

  1. grep

grep -o -P '(?={GENEID}).*(?=>)

Here I have the same issue - works individually, but not with xargs or a loop.

  1. using a for loop

     for LINE in $(cat geneIDs.txt); do echo $LINE >> seqs.txt sed -n -e '/$LINE/,/>/p' referencefile.fasta | sed $d >> seqs.txt done

I'm also open to trying something in python, though I'm not that well-versed in it yet. My preliminary attempt has been based on this question here . I have a test ID list of 10 lines, which I tried to run like this:

t = open('test.txt', 'r')
test = t.readlines()
test = test.split()
t.close()

with open('referencefile.fasta', 'r') as ref:
    for line in ref:
        for i in test:
            if i in line:
                print(line)

This one, I couldn't even get a sequence from the reference file, regardless of the loop.

Can you guys spot the issue? Why won't any of these give me sequences?

Thanks in advance!

Edited to add:

Example reference:

>000000F
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg


>000001F
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

>000002F

TGCGTGAGGTGCTAGGGATGACAATTGAAAAGAGGACATTGATCGATCACTTGACTCATTTCAGAAAGGAGTTTGGGTTGTCCAACAAGTTGAGGGGGATGATCATCAGGCATCCTGAGT

test IDs: 000000F, 000001F

Ideal result:

000000F ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg

000001F NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Current result:

000000F 000001F

If there is always a single line after one geneId in your fasta file, this will help:

grep -A1 -Fwf geneIds.txt input.fasta

check this example:

$  head -n 20 *
==> ids.txt <==
000000F
000001F

==> input.fasta <==
>000000F
Yes I want it!


>000001F
Yes I want it too!

>000002F
skip

>00000XYZ
skip

kent$  grep -A1 -Fwf ids.txt input.fasta
>000000F
Yes I want it!
--
>000001F
Yes I want it too!

depending on size and access patterns and what else you may use the sequence for it may be easiest to just build a BLAST database, then feed it your identifiers and it will return exactly what you are asking for (except correctly formatted FASTA).

pros are it is well designed, tested and fast

cons are it may be overkill for your task

(but still super useful if you will be continuing to work in this space)

https://duckduckgo.com/?q=build+a+blast+database&ia=web

Given:

$ cat file
>000000F
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg


>000001F
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

With awk you can read data separated by two or more \n in paragraph mode . This allows you to easily build an associative database of a file in that format.

Example, search by exact string:

awk -v RS= -v FS="\n" -v q=">000000F" '$1==q{print $2}' file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg

Or search by regex:

awk -v RS= -v FS="\n" -v q="[01]F$" '$1~q {print $2}' file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Or, build an associative array:

awk -v RS= -v FS="\n"   '{arr[$1]=$2} END{ "do something with the data in arr" }' file

Which you could use to print from a file with a list of ids:

cat ids
>000001F
>000000F

awk -v RS= -v FS="\n"  'FNR==NR{for(i=1; i<=NF; i++) ids[$i]; next}
$1 in ids{print $2}' ids file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM