简体   繁体   English

如何使用正则表达式在单词边界处拆分?

[英]How can I split at word boundaries with regexes?

I'm trying to do this:我正在尝试这样做:

import re
sentence = "How are you?"
print(re.split(r'\b', sentence))

The result being结果是

[u'How are you?']

I want something like [u'How', u'are', u'you', u'?'] .我想要[u'How', u'are', u'you', u'?']东西。 How can this be achieved?如何做到这一点?

Unfortunately, Python cannot split by empty strings. 不幸的是,Python无法通过空字符串进行拆分。

To get around this, you would need to use findall instead of split . 要解决这个问题,您需要使用findall而不是split

Actually \\b just means word boundary. 实际上\\b只是意味着词边界。

It is equivalent to (?<=\\w)(?=\\W)|(?<=\\W)(?=\\w) . 它相当于(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)

That means, the following code would work: 这意味着,以下代码将起作用:

import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))
import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)

Output: 输出:

['How', 'are', 'you', '?']

Ideone Demo Ideone演示

Regex101 Demo Regex101演示


Regex Explanation: 正则表达式说明:

"[\w']+|[.,!?;]"

    1st Alternative: [\w']+
        [\w']+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            \w match any word character [a-zA-Z0-9_]
            ' the literal character '
    2nd Alternative: [.,!?;]
        [.,!?;] match a single character present in the list below
            .,!?; a single character in the list .,!?; literally

Here is my approach to split on word boundaries:这是我在单词边界上split的方法:

re.split(r"\b\W\b", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']

and using findall on word boundaries并在单词边界上使用findall

re.findall(r"\b\w+\b", "How are you?")
# Result: ['How', 'are', 'you']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM