简体   繁体   English

tokenize一个字符串,保留Python中的分隔符

[英]tokenize a string keeping delimiters in Python

Is there any equivalent to str.split in Python that also returns the delimiters? Python中的str.split有没有等同于返回分隔符?

I need to preserve the whitespace layout for my output after processing some of the tokens. 我需要在处理一些令牌后保留输出的空白布局。

Example: 例:

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Thanks! 谢谢!

How about 怎么样

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)
>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

the re module provides this functionality: re模块提供此功能:

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(quoted from the Python documentation). (引自Python文档)。

For your example (split on whitespace), use re.split('(\\s+)', '\\tThis is an example') . 对于你的例子(在空格上拆分),使用re.split('(\\s+)', '\\tThis is an example')

The key is to enclose the regex on which to split in capturing parentheses. 关键是将正则表达式括起来分割捕获括号。 That way, the delimiters are added to the list of results. 这样,分隔符就会添加到结果列表中。

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. 编辑:正如所指出的,任何前面/后面的分隔符当然也会被添加到列表中。 To avoid that you can use the .strip() method on your input string first. 为避免这种情况,您可以首先在输入字符串上使用.strip()方法。

Have you looked at pyparsing? 你看过pyparsing吗? Example borrowed from the pyparsing wiki : 用于pyparsing wiki的示例:

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})

Thanks guys for pointing for the re module, I'm still trying to decide between that and using my own function that returns a sequence... 谢谢大家指点re模块,我仍然试图在它之间做出决定并使用我自己的函数返回一个序列......

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

If I had time I'd benchmark them xD 如果我有时间,我会对它们进行基准测试xD

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM