I would like to extract certain lines and its following sequencing data.
There is a ecoli.ffn
file as follows:
$head ecoli.ffn
>ecoli16:g027092:GCF_000460315:gi|545267691|ref|NZ_KE701669.1|:551259-572036
ATGAGCCTGATTATTGATGTTATTTCGCGT
AAAACATCCGTCAAACAAACGCTGATTAAT
>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC
and a index.txt as following
$head index.txt
g000011
g000012
what I want to do is "extract index.txt from ecoli.ffn", the ideal output is:
>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC
How can I do this?
awk
to the rescue!
$ awk -F: -v RS=">" 'NR==FNR{n=split($0,t,"\n");
for(i=1;i<n;i++) a[t[i]];
next}
$2 in a{printf "%s", RS $0}' index file
>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGATCTGACAGCTGTTCTTACACTGGATTCAACC
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGATCTGACAGCTGTTCTTACACTGGATTCAACC
UPDATE Note that this doesn't depend on how many lines are there for each record. For the updated input file, same script will give you this output
$ awk -F: -v RS=">" 'NR==FNR{n=split($0,t,"\n");
for(i=1;i<n;i++) a[t[i]];
next}
$2 in a{printf "%s", RS $0}' index file
>ecoli16:g000011:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
>ecoli16:g000012:55989:gi|218693476|ref|NC_011748.1|:1128430-1131042
GTGTACGCTATGGCGGGTAATTTTGCCGAT
CTGACAGCTGTTCTTACACTGGATTCAACC
CTGACAGCTGTTCTTACACTGGATTCAACC
write a simple script ecoli.sh using awk:
#!/bin/bash
a=`cat index.txt`
for i in $a
do
cat ecoli.ffn|awk -F: -v i="$i" 'BEGIN{flag=0} {if($2 == i){print $0;flag=1;} if(flag ==1 && $2 != i){print $0; flag=0;} }'
done
then you need to run this script in your shell.
This script can be used to filter a FASTA file by a list or file based on their IDs, which seems to be what you are asking for here:
https://github.com/jorvis/biocode/blob/master/fasta/filter_fasta_by_ids.pl
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.