简体   繁体   中英

How to specify number of characters in each line in python?

I have fasta file that contains two gene sequences and what i want to is to remove the fasta header (line starting with ">"), concatenate the rest of the lines and output that sequence in length of 50 characters per line. I made some progress but got struck at the end.

Here is my fasta sequence:

>Potrs164783
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGATCAGAATTGAACCAACATGATGAAGGGGAT
TGTTTGCCATCAGAATATGGCATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTTAGTGAGAA
CTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAA
>Potrs164784
TTACCCTCTACCAGCACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAACCATTGTTACCATTCC
GGAATTACATTCTGAGATAAAAACCCTCAAATCTGAATTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

And the output that i want is something like this

>conc
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGA
TCAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGG
CATGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCC
TTAGTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATA
AGTTAAAGAAAAACTTGAAACAAATAACAAGCATGCATAAACCATATGGC
ATTTTGCATCCATTTGTGCATTTCATTTAGTTTACTTGCATTCATTCAGG

My script so far is

final = list()

with open("test.fa", 'r') as fh_in:
    for line in fh_in:
        line = line.strip()
        if not line.startswith(">"):
            final.append(line)

final2 = "".join(final)

with open("testconcat.fa", 'w') as fh_out:
    fh_out.write(">con")
    fh_out.write("\n")
    fh_out.write(final2)

How can i make sure that i only write 50 characters in each line?

You can use the inbuilt textwrap library

import textwrap
final2 = "".join(final)
print '\n'.join(textwrap.wrap(final2, 50)

When dealing with large files, if you do the joining, slicing etc. in memory, you may end up getting weird issues as you'll be consuming relatively much more memory for appending each line and then splitting them again into equally divided chunks before actually writing to file.

I think the best way to do avoid such issues is operating on file not on memory, in other words, you should write as you read at the same time.

>>> with open('test.fa', 'r') as r, open('testconcat.fa', 'w') as w:
...     for line in r:
...         if not line.startswith(">"):
...             w.write(line.strip())

>>> with open('testconcat.fa', 'r+') as w:
...     chunk = 50
...     i = 0
...     while next(w, None):
...         w.seek(((i + 1) * chunk) + i)
...         w.write('\n')
...         i = i + 1

>>> cat testconcat.fa
AGGAAGTGTGAGATTGAAAAAACATTACTATTGAGGAATTTTTGACCAGA 
CAGAATTGAACCAACATGATGAAGGGGATTGTTTGCCATCAGAATATGGC
TGAAATTTCTCCCCTAGATCGGTTCAAGCTCCTGTAGGTTTGGAGTCCTT
GTGAGAACTTTCTTAAGAGAATCTAATCTGGTCTGTTCCTCGTCATAAGT
AAAGAAAAACTTGAAACAAATAACAAGCATGCATAATTACCCTCTACCAG
ACCAATGCCTATGATCTTACAAAAATCCTTAATAAAAAGAAATCCAAAAC
ATTGTTACCATTCCGGAATTACATTCTGAGATAAAAACCCTCAAATCTGA
TTACAATCCCTTAAACAAGCCCAACAAAAAGACTCTGCCATAC

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM