In a file, I have some characters to be substituted.
letters = ["B", "Z", "J", "U", "O"]
for record in SeqIO.parse(inFile, "fasta"):
for letter in letters:
if letters in str(record.seq):
print record.id
record.seq = str(record.seq).replace(letter, "X")
outFile.write(">%s\n%s\n" % (record.description, record.seq))
else:
outFile.write(">%s\n%s\n" % (record.description, record.seq))
#pass
The problem is that the output looks like this, writing the outputs as many characters I have in letters:
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
I think what you are trying to do is replace the ambiguous IUPAC amino acid codes (plus some additional letters that you have somehow gained?) with 'X'
.
Better to use str.translate()
(in Python 3) to do all your replacements at once. Also, since you're using Biopython to read the file, you can also write the output file easily with Biopython.
from Bio import SeqIO
from Bio.Seq import Seq
letters = ["B", "Z", "J", "U", "O"]
trans_tab = str.maketrans(''.join(letters), 'X'*len(letters))
def yield_seqs(in_file):
for record in SeqIO.parse(in_file, 'fasta'):
record.seq = Seq(str(record.seq).translate(trans_tab))
yield record
SeqIO.write(yield_seqs('input.fasta'), 'output.fasta', 'fasta')
Example:
$ cat input.fasta
>1
MBZJ
$ python3 myscript.py
$ cat output.fasta
>1
MXXX
You have a typo.
if letters in str(record.seq):
instead of
if letter in str(record.seq)
so, your check always fails, and prints the else
part for every single letter.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.