如何用正则表达式将句子拆分为单词？

Question

"She's so nice!" “她真好！” -> ["she","'","s","so","nice","!"] I want to split sentence like this! -> [“ she”，“'”，“ s”，“ so”，“ nice”，“！”]我想这样分割句子！ so I wrote the code, but It includes white space! 所以我写了代码，但是它包含空格！ How to make code only using regular expression? 如何仅使用正则表达式制作代码？

        words = re.findall('\W+|\w+')

-> ["she", "'","s", " ", "so", " ", "nice", "!"] -> [“ she”，“'”，“ s”，“”，“ so”，“”，“ nice”，“！”]

        words = [word for word in words if not word.isspace()]

Answer 1

Regex : [A-Za-z]+|[^A-Za-z ] 正则表达式 ： [A-Za-z]+|[^A-Za-z ]

In [^A-Za-z ] add chars you don't want to match. 在[^A-Za-z ]添加您不想匹配的字符。

Details: 细节：

[] Match a single character present in the list []匹配列表中存在的单个字符
[^] Match a single character NOT present in the list [^]匹配列表中不存在的单个字符
+ Matches between one and unlimited times +无限次匹配
| Or 要么

Python code : Python代码 ：

text = "She's so nice!"
matches = re.findall(r'[A-Za-z]+|[^A-Za-z ]', text)

Output: 输出：

['She', "'", 's', 'so', 'nice', '!']

Code demo 代码演示

Answer 2

Python's re module doesn't allow you to split on zero-width assertions. Python的re模块不允许您拆分零宽度的断言。 You can use python's pypi regex package instead (ensuring you specify to use version 1, which properly handles zero-width matches). 您可以改用python的pypi regex包（确保您指定使用版本1，该版本可以正确处理零宽度匹配）。

See code in use here 在这里查看正在使用的代码

import regex

s = "She's so nice!"
x = regex.split(r"\s+|\b(?!^|$)", s, flags=regex.VERSION1)

print(x)

Output: ['She', "'", 's', 'so', 'nice', '!'] 输出： ['She', "'", 's', 'so', 'nice', '!']

\\s+|\\b(?!^|$) Match either of the following options \\s+|\\b(?!^|$)匹配以下任一选项
- \\s+ Match one or more whitespace characters \\s+匹配一个或多个空格字符
- \\b(?!^|$) Assert position as a word boundary, but not at the beginning or end of the line \\b(?!^|$)位置为单词边界，但不在行的开头或结尾

如何用正则表达式将句子拆分为单词？

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-02-12 18:30:14

解决方案2
0 2018-02-12 18:34:45

如何用正则表达式将句子拆分为单词？

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-02-12 18:30:14

解决方案2 0 2018-02-12 18:34:45

解决方案1
2 已采纳 2018-02-12 18:30:14

解决方案2
0 2018-02-12 18:34:45