简体   繁体   English

如何将一个字符串分成相等大小的部分?

[英]How to split a string into equal sized parts?

I have a string that contains a sequence of nucleotides.我有一个包含核苷酸序列的字符串。 The string is 1191 nucleotides long.该字符串的长度为 1191 个核苷酸。

How do I print the sequence in a format which each line only has 100 nucleotides?如何以每行只有 100 个核苷酸的格式打印序列? right now I have it hard coded but I would like it to work for any string of nucleotides.现在我对它进行了硬编码,但我希望它适用于任何核苷酸串。 here is the code I have now这是我现在的代码

def printinfasta(SeqName, Sequence, SeqDescription):
    print(SeqName + " " + SeqDescription)
    #how do I make sure to only have 100 nucleotides per line?
    print(Sequence[0:100])
    print(Sequence[100:200])
    print(Sequence[200:300])
    print(Sequence[400:500])
    print(Sequence[500:600])
    print(Sequence[600:700])
    print(Sequence[700:800])
    print(Sequence[800:900])
    print(Sequence[900:1000])
    print(Sequence[1000:1100])
    print(Sequence[1100:1191])
printinfasta(SeqName, Sequence, SeqDescription)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"

You can use textwrap.wrap to split long strings into list of strings您可以使用textwrap.wrap将长字符串拆分为字符串列表

import textwrap

seq = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
print('\n'.join(textwrap.wrap(seq, width=100)))

You can use itertools.zip_longest and some iter magic to get this in one line:您可以使用itertools.zip_longest和一些iter魔法在一行中得到它:

from itertools import zip_longest

sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT" 

output = [''.join(filter(None, s)) for s in zip_longest(*([iter(sequence)]*100))]

Or:或者:

for s in zip_longest(*([iter(sequence)]*100)):
    print(''.join(filter(None, s)))

A possible solution is to use re module.一个可能的解决方案是使用re模块。

import re

def splitstring(strg, leng):
    chunks = re.findall('.{1,%d}' % leng, strg)
    for i in chunks:
        print(i)


splitstring(strg = seq, leng = 100))

I assume that your sequence is in FASTA format.我假设您的序列是 FASTA 格式。 If this is the case, you can use any of a number of bioinformatics packages that provide FASTA sequence wrapping utilities.如果是这种情况,您可以使用许多提供 FASTA 序列包装实用程序的生物信息学软件包中的任何一个。 For example, you can use FASTX-Toolkit .例如,您可以使用FASTX-Toolkit Wrap FASTA sequences using FASTA Formatter command line utility, for example to a max of 100 nucleotides per line:使用FASTA Formatter命令行实用程序包装 FASTA 序列,例如每行最多 100 个核苷酸:

fasta_formatter -i INFILE -o OUTFILE -w 100

You can install FASTX-Toolkit package using conda , for example:您可以安装FASTX-Toolkit使用包conda ,例如:
conda install fastx_toolkit
or或者
conda create -n fastx_toolkit fastx_toolkit

Note that if you end up writing the (simple) code to wrap FASTA sequences from scratch, remember that the header lines (the lines starting with > ) should not be wrapped.请注意,如果结束了写入(简单)代码从头包裹FASTA序列,请记住,标题行(行开始>应该被缠绕。 Wrap only the sequence lines.仅包装序列行。

SEE ALSO:也可以看看:

Convert single line fasta to multi line fasta将单行 fasta 转换为多行 fasta

You can use a helper function based on itertools.zip_longest .您可以使用基于itertools.zip_longest的辅助函数。 The helper function has been designed to (also) handle cases where the sequence isn't an exact multiple of the size of the equal parts (the last group will have fewer elements than those before it). helper 函数被设计为(也)处理序列不是相等部分大小的精确倍数的情况(最后一组将比之前的元素少)。

from itertools import zip_longest


def grouper(n, iterable):
    """ s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... """
    FILLER = object()  # Value that couldn't be in data.
    for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
        yield ''.join(v for v in result if v is not FILLER)


def printinfasta(SeqName, Sequence, SeqDescription):
    print(SeqName + " " + SeqDescription)
    for group in grouper(100, Sequence):
        print(group)

Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"

printinfasta('Name', Sequence, 'Description')

Sample output:示例输出:

Name Description
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTA
AATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCC
TAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTT
TGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACAT
TTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT

Package cytoolz (installable using pip install cytoolz ) provides a function partition_all that can be used here:cytoolz (可使用pip install cytoolz )提供了一个可以在此处使用的函数partition_all

#!/usr/bin/env python3
from cytoolz import partition_all

def printinfasta(name, seq, descr):
    header = f">{name} {descr}"
    print(header)
    print(*map("".join, partition_all(100, seq)), sep="\n")


printinfasta("test", 468 * "ACGTGA", "this is a test")

partition_all(100, seq)) generate tuples of 100 letters each taken from seq , and a last shorter one is the number of letters is not a multiple of 100. partition_all(100, seq))生成每个取自seq的 100 个字母的元组,最后一个较短的是字母数不是 100 的倍数。

The map("".join, ...) is used to group letters within each such tuple into a single string. map("".join, ...)用于将每个此类元组中的字母分组为单个字符串。

The * in front of the map makes its results considered as separate arguments to print . map前面的*使其结果被视为print单独参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM