Hello I have a fasta file such as:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence2 [virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence3
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence5 hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence7 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
And in this file I would like to remove duplicated sequence and get:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
Here as you can see the containt after the > name
for sequence1_CP
, sequence2
and sequence3
is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP
in its name, then I want to keep this one especially. If there is none _CP
in any of them it does not mater wich one I keep.
Sequence1_CP
, Sequence2
and Sequence3
I keep sequence1_CP
sequence4_CP
and sequence5
I keep sequence4_CP
sequence7
I keep the first one sequence6
Does someone have an idea using biopython or a bash method? Thanks a lot
You could use this awk one-liner:
$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file
Output:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
>sequence6 |hypothetical protein[virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD
ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
Above relies on observation that the sequences are in order where the _CP
s come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP
sequence is found:
$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file
Or in pretty-print:
$ awk '
BEGIN {
FS="\n"
RS=""
}
{
if(!($2,$3) in seen||$1~/^[^ ]+_CP /)
seen[$2,$3]=$0
}
END {
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
The output order is awk default ie. appears random.
Update If @kvantour's BOTH comments are valid in this case, use this awk:
$ awk '
BEGIN {
FS="\n"
RS=""
}
{
for(i=2;i<=NF;i++)
k=(i==2?"":k) $i
if(!(k in seen)||$1~/^[^ ]+_CP /)
seen[k]=$0
}
END {
for(i in seen)
print (++j>1?ORS:"") seen[i]
}' file
Output now:
>sequence1_CP [seq virus]
MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL
DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE
>sequence4_CP hypothetical protein [another virus]
MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN
IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK
In a fasta file, identical sequences are not necessarily split at the same position. So it is paramount to merge the sequences before comparing. Furthermore, sequences can have upper case or lower case, but are in the end case insensitive:
The following awk will do exactly that:
$ awk 'BEGIN{RS="";ORS="\n\n"; FS="\n"}
{seq="";for(i=2;i<=NF;++i) seq=seq toupper($i)}
!(seq in a){print; a[seq]}' file.fasta
There exists actually a version of awk which has been upgraded to process fasta files:
$ bioawk -c fastx '!(seq in a){print; a[seq]}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX .
Or pure-bash solution (following same log as separate perl
solution):
#! /bin/bash
declare -A p
# Read inbound data into associative array 'p'
while read id ; do
read s1 ; read s2 ; read s3
key="$s1:$s2"
prev=${p[$key]}
if [[ -z "$prev" || "$id" = %+CP% ]] ; then p[$key]=$id ; fi
done
# Print all data
for k in "${!p[@]}" ; do
echo -e "${p[$k]}\n${k/:/\\n}\n"
done
Here's a python program that will provide you with results you are looking for:
import fileinput
import re
seq=""
nameseq={}
seqnames={}
for line in fileinput.input():
line = line.rstrip()
if re.search( "^>", line ):
if seq:
nameseq[ id ] = seq
if seq in seqnames:
if re.search( "_CP", id ):
seqnames[ seq ] = id
else:
seqnames[ seq ] = id
seq = ""
id = line
continue
seq += line
for k,v in seqnames.iteritems():
print(v)
print(k)
Or with perl
. Assuming code in m.pl,can be wrapped into bash script
Hopefully, code will help find medicines, and not develop new viruses:-)
perl m.pl < input-file
! /usr/bin/perl
use strict ;
my %to_id ;
local $/ = "\n\n";
while ( <> ) {
chomp ;
my ($id, $s1, $s2 ) = split("\n") ;
my $key = "$s1\n$s2" ;
my $prev_id = $to_id{$key} ;
$to_id{$key} = $id if !defined($prev_id) || $id =~ /_CP/ ;
} ;
print "$to_id{$_}\n$_\n\n" foreach(keys(%to_id)) ;
It's not clear what is the expected order. Perl code will print directly from hash. Can be customized, if needed.
Here's a Biopython answer. Be aware that you only have two unique sequences in your example (sequence 6 and 7 only show a character more in the first line but are essentially the same protein sequence as 1).
from Bio import SeqIO
seen = []
records = []
# examples are in sequences.fasta
for record in SeqIO.parse("sequences.fasta", "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
# printing to console
for record in records:
print(record.name)
print(record.seq)
# writing to a fasta file
SeqIO.write(records, "unique_sequences.fasta", "fasta")
For more info you can try the biopython cookbook
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.