简体   繁体   中英

Changing the format of name of fasta sequences in a file including nucleotide number in sequence

I don't know much programming but I am learning Linux and Python. I have a sequence file which has 13500 sequences in it. and the name of the sequences is in one form

>MP_scaffold_001_1

I want to count number of nucleotides in each sequence and want to change its name to

>MP_scaffold_001_1 <TAB> <Number_of_nucleotides>

If you're working with biological sequences in Python, you can't go wrong with Biopython . The SeqIO class contains tools for working with sequences, including those in FASTA format. The following code should get you started:

from Bio import SeqIO
with open("input.fasta", "r") as input, open("output.fasta", "w") as output:
    for seq in SeqIO.parse(input, "fasta"):
        length = "\t%d" % len(seq)
        seq.description += length
        SeqIO.write(seq, output, "fasta")

This code first opens two file handlers, input and output , that will automatically be closed when the processing is complete. Next, each sequence ( seq ) in input is iterated through using the SeqIO.parse() method. The length of the sequence is determined by using Python's built-in len() function, and a formatting string is built using the tab character \\t and the number returned by len() . Then, the description string of each seq is modified by adding the contents of the length variable onto the end of it. Finally, the newly-modified record is written to the output file in FASTA format.

I'd highly recommend reading through Biopython's Tutorial and Cookbook to familiarize yourself with all that the module provides.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM