I have a reference file (.fasta) and a list of gene IDs. For each ID in the gene ID list, I need to get the corresponding sequence into a text file. How can I automate this?
Things I've tried so far:
sed -n -e '/{GENEID1}/,/>/p' referencefile.fasta | sed $d >> seqs.txt
with '>' being the character at which I'd like sed to stop. I need the second sed to remove the last line, which grabs the first line of the next sequence, too. This works if I just run it once, but if I try
cat geneID.txt | xargs sed -n -e '/{}/,/>/p' referencefile.fasta >> seqs.txt
then I get just a list of IDs, with no sequences. It also takes super long, so I assume sed is reading through the reference file, but I don't see why it won't grab the sequences?
grep -o -P '(?={GENEID}).*(?=>)
Here I have the same issue - works individually, but not with xargs or a loop.
using a for loop
for LINE in $(cat geneIDs.txt); do echo $LINE >> seqs.txt sed -n -e '/$LINE/,/>/p' referencefile.fasta | sed $d >> seqs.txt done
I'm also open to trying something in python, though I'm not that well-versed in it yet. My preliminary attempt has been based on this question here . I have a test ID list of 10 lines, which I tried to run like this:
t = open('test.txt', 'r')
test = t.readlines()
test = test.split()
t.close()
with open('referencefile.fasta', 'r') as ref:
for line in ref:
for i in test:
if i in line:
print(line)
This one, I couldn't even get a sequence from the reference file, regardless of the loop.
Can you guys spot the issue? Why won't any of these give me sequences?
Thanks in advance!
Edited to add:
Example reference:
>000000F
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
>000001F
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>000002F
TGCGTGAGGTGCTAGGGATGACAATTGAAAAGAGGACATTGATCGATCACTTGACTCATTTCAGAAAGGAGTTTGGGTTGTCCAACAAGTTGAGGGGGATGATCATCAGGCATCCTGAGT
test IDs: 000000F, 000001F
Ideal result:
000000F ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
000001F NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Current result:
000000F 000001F
If there is always a single line after one geneId in your fasta file, this will help:
grep -A1 -Fwf geneIds.txt input.fasta
check this example:
$ head -n 20 *
==> ids.txt <==
000000F
000001F
==> input.fasta <==
>000000F
Yes I want it!
>000001F
Yes I want it too!
>000002F
skip
>00000XYZ
skip
kent$ grep -A1 -Fwf ids.txt input.fasta
>000000F
Yes I want it!
--
>000001F
Yes I want it too!
depending on size and access patterns and what else you may use the sequence for it may be easiest to just build a BLAST database, then feed it your identifiers and it will return exactly what you are asking for (except correctly formatted FASTA).
pros are it is well designed, tested and fast
cons are it may be overkill for your task
(but still super useful if you will be continuing to work in this space)
Given:
$ cat file
>000000F
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
>000001F
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
With awk
you can read data separated by two or more \n
in paragraph mode . This allows you to easily build an associative database of a file in that format.
Example, search by exact string:
awk -v RS= -v FS="\n" -v q=">000000F" '$1==q{print $2}' file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
Or search by regex:
awk -v RS= -v FS="\n" -v q="[01]F$" '$1~q {print $2}' file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Or, build an associative array:
awk -v RS= -v FS="\n" '{arr[$1]=$2} END{ "do something with the data in arr" }' file
Which you could use to print from a file with a list of ids:
cat ids
>000001F
>000000F
awk -v RS= -v FS="\n" 'FNR==NR{for(i=1; i<=NF; i++) ids[$i]; next}
$1 in ids{print $2}' ids file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.