I don't know much programming but I am learning Linux and Python. I have a sequence file which has 13500 sequences in it. and the name of the sequences is in one form
>MP_scaffold_001_1
I want to count number of nucleotides in each sequence and want to change its name to
>MP_scaffold_001_1 <TAB> <Number_of_nucleotides>
If you're working with biological sequences in Python, you can't go wrong with Biopython . The SeqIO
class contains tools for working with sequences, including those in FASTA format. The following code should get you started:
from Bio import SeqIO
with open("input.fasta", "r") as input, open("output.fasta", "w") as output:
for seq in SeqIO.parse(input, "fasta"):
length = "\t%d" % len(seq)
seq.description += length
SeqIO.write(seq, output, "fasta")
This code first opens two file handlers, input
and output
, that will automatically be closed when the processing is complete. Next, each sequence ( seq
) in input
is iterated through using the SeqIO.parse()
method. The length of the sequence is determined by using Python's built-in len()
function, and a formatting string is built using the tab character \\t
and the number returned by len()
. Then, the description
string of each seq
is modified by adding the contents of the length
variable onto the end of it. Finally, the newly-modified record is written to the output file in FASTA format.
I'd highly recommend reading through Biopython's Tutorial and Cookbook to familiarize yourself with all that the module provides.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.