tokenize一个字符串，保留Python中的分隔符

Question

Is there any equivalent to str.split in Python that also returns the delimiters? Python中的str.split有没有等同于返回分隔符？

I need to preserve the whitespace layout for my output after processing some of the tokens. 我需要在处理一些令牌后保留输出的空白布局。

Example: 例：

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Thanks! 谢谢！

Answer 1

How about 怎么样

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)

Answer 2

>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Answer 3

the re module provides this functionality: re模块提供此功能：

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(quoted from the Python documentation). （引自Python文档）。

For your example (split on whitespace), use re.split('(\\s+)', '\\tThis is an example') . 对于你的例子（在空格上拆分），使用re.split('(\\s+)', '\\tThis is an example') 。

The key is to enclose the regex on which to split in capturing parentheses. 关键是将正则表达式括起来分割捕获括号。 That way, the delimiters are added to the list of results. 这样，分隔符就会添加到结果列表中。

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. 编辑：正如所指出的，任何前面/后面的分隔符当然也会被添加到列表中。 To avoid that you can use the .strip() method on your input string first. 为避免这种情况，您可以首先在输入字符串上使用.strip()方法。

Answer 4

Have you looked at pyparsing? 你看过pyparsing吗？ Example borrowed from the pyparsing wiki : 借用于pyparsing wiki的示例：

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})

Answer 5

Thanks guys for pointing for the re module, I'm still trying to decide between that and using my own function that returns a sequence... 谢谢大家指点re模块，我仍然试图在它之间做出决定并使用我自己的函数返回一个序列......

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

If I had time I'd benchmark them xD 如果我有时间，我会对它们进行基准测试xD

tokenize一个字符串，保留Python中的分隔符

问题描述

5 个解决方案

解决方案1
19 已采纳 2009-11-30 15:08:11

解决方案2
6 2009-11-30 15:08:56

解决方案3
4 2009-11-30 15:09:01

解决方案4
3 2009-11-30 15:39:03

解决方案5
-1 2009-11-30 15:28:21

tokenize一个字符串，保留Python中的分隔符

问题描述

5 个解决方案

解决方案1 19 已采纳 2009-11-30 15:08:11

解决方案2 6 2009-11-30 15:08:56

解决方案3 4 2009-11-30 15:09:01

解决方案4 3 2009-11-30 15:39:03

解决方案5 -1 2009-11-30 15:28:21

解决方案1
19 已采纳 2009-11-30 15:08:11

解决方案2
6 2009-11-30 15:08:56

解决方案3
4 2009-11-30 15:09:01

解决方案4
3 2009-11-30 15:39:03

解决方案5
-1 2009-11-30 15:28:21