[英]How to split a string into equal sized parts?
I have a string that contains a sequence of nucleotides.我有一个包含核苷酸序列的字符串。 The string is 1191 nucleotides long.
该字符串的长度为 1191 个核苷酸。
How do I print the sequence in a format which each line only has 100 nucleotides?如何以每行只有 100 个核苷酸的格式打印序列? right now I have it hard coded but I would like it to work for any string of nucleotides.
现在我对它进行了硬编码,但我希望它适用于任何核苷酸串。 here is the code I have now
这是我现在的代码
def printinfasta(SeqName, Sequence, SeqDescription):
print(SeqName + " " + SeqDescription)
#how do I make sure to only have 100 nucleotides per line?
print(Sequence[0:100])
print(Sequence[100:200])
print(Sequence[200:300])
print(Sequence[400:500])
print(Sequence[500:600])
print(Sequence[600:700])
print(Sequence[700:800])
print(Sequence[800:900])
print(Sequence[900:1000])
print(Sequence[1000:1100])
print(Sequence[1100:1191])
printinfasta(SeqName, Sequence, SeqDescription)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
You can use textwrap.wrap
to split long strings into list of strings您可以使用
textwrap.wrap
将长字符串拆分为字符串列表
import textwrap
seq = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
print('\n'.join(textwrap.wrap(seq, width=100)))
You can use itertools.zip_longest
and some iter
magic to get this in one line:您可以使用
itertools.zip_longest
和一些iter
魔法在一行中得到它:
from itertools import zip_longest
sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
output = [''.join(filter(None, s)) for s in zip_longest(*([iter(sequence)]*100))]
Or:或者:
for s in zip_longest(*([iter(sequence)]*100)):
print(''.join(filter(None, s)))
A possible solution is to use re
module.一个可能的解决方案是使用
re
模块。
import re
def splitstring(strg, leng):
chunks = re.findall('.{1,%d}' % leng, strg)
for i in chunks:
print(i)
splitstring(strg = seq, leng = 100))
I assume that your sequence is in FASTA format.我假设您的序列是 FASTA 格式。 If this is the case, you can use any of a number of bioinformatics packages that provide FASTA sequence wrapping utilities.
如果是这种情况,您可以使用许多提供 FASTA 序列包装实用程序的生物信息学软件包中的任何一个。 For example, you can use
FASTX-Toolkit
.例如,您可以使用
FASTX-Toolkit
。 Wrap FASTA sequences using FASTA Formatter
command line utility, for example to a max of 100 nucleotides per line:使用
FASTA Formatter
命令行实用程序包装 FASTA 序列,例如每行最多 100 个核苷酸:
fasta_formatter -i INFILE -o OUTFILE -w 100
You can install FASTX-Toolkit
package using conda
, for example:您可以安装
FASTX-Toolkit
使用包conda
,例如:
conda install fastx_toolkit
or或者
conda create -n fastx_toolkit fastx_toolkit
Note that if you end up writing the (simple) code to wrap FASTA sequences from scratch, remember that the header lines (the lines starting with >
) should not be wrapped.请注意,如果结束了写入(简单)代码从头包裹FASTA序列,请记住,标题行(行开始
>
)不应该被缠绕。 Wrap only the sequence lines.仅包装序列行。
SEE ALSO:也可以看看:
Convert single line fasta to multi line fasta将单行 fasta 转换为多行 fasta
You can use a helper function based on itertools.zip_longest
.您可以使用基于
itertools.zip_longest
的辅助函数。 The helper function has been designed to (also) handle cases where the sequence isn't an exact multiple of the size of the equal parts (the last group will have fewer elements than those before it). helper 函数被设计为(也)处理序列不是相等部分大小的精确倍数的情况(最后一组将比之前的元素少)。
from itertools import zip_longest
def grouper(n, iterable):
""" s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... """
FILLER = object() # Value that couldn't be in data.
for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
yield ''.join(v for v in result if v is not FILLER)
def printinfasta(SeqName, Sequence, SeqDescription):
print(SeqName + " " + SeqDescription)
for group in grouper(100, Sequence):
print(group)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
printinfasta('Name', Sequence, 'Description')
Sample output:示例输出:
Name Description
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTA
AATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCC
TAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTT
TGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACAT
TTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT
Package cytoolz
(installable using pip install cytoolz
) provides a function partition_all
that can be used here:包
cytoolz
(可使用pip install cytoolz
)提供了一个可以在此处使用的函数partition_all
:
#!/usr/bin/env python3
from cytoolz import partition_all
def printinfasta(name, seq, descr):
header = f">{name} {descr}"
print(header)
print(*map("".join, partition_all(100, seq)), sep="\n")
printinfasta("test", 468 * "ACGTGA", "this is a test")
partition_all(100, seq))
generate tuples of 100 letters each taken from seq
, and a last shorter one is the number of letters is not a multiple of 100. partition_all(100, seq))
生成每个取自seq
的 100 个字母的元组,最后一个较短的是字母数不是 100 的倍数。
The map("".join, ...)
is used to group letters within each such tuple into a single string. map("".join, ...)
用于将每个此类元组中的字母分组为单个字符串。
The *
in front of the map
makes its results considered as separate arguments to print
. map
前面的*
使其结果被视为print
单独参数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.