正则表达式在 Python 中拆分单词

Question

我正在设计一个正则表达式来分割给定文本中的所有实际单词：

输入示例：

"John's mom went there, but he wasn't there. So she said: 'Where are you'"

预期输出：

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]

我想到了这样的正则表达式：

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

在 Python 中拆分后，结果包含None项和空格。

如何摆脱 None 项目？ 为什么空格不匹配？

编辑：
在空格上拆分，将给出如下项目： ["there."]
并在非字母上拆分，将给出如下项目： ["John","s"]
并在除'之外'非字母上拆分，将给出以下项目： ["'Where","you'"]

Answer 1

您可以使用字符串函数代替正则表达式：

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

但是，在您的示例中，您不想删除John's中的撇号，但您希望将其删除you!!' . 所以字符串操作在这一点上失败，你需要一个微调的正则表达式。

编辑：可能一个简单的正则表达式可以解决您的问题：

(\w[\w']*)

它将捕获所有以字母开头的字符，并在下一个字符是撇号或字母时继续捕获。

(\w[\w']*\w)

第二个正则表达式适用于非常特殊的情况......第一个正则表达式可以捕获像you'单词you' 。 这个将避免这一点，并且只有在单词中（不在开头或结尾）时才会捕获撇号。 但是在这一点上，出现了这样的情况，您无法使用第二个正则表达式捕获撇号Moss' mom 。 你必须决定是否将捕获尾随结束机智S和界定所有权名撇号。

例子：

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

更新 2：我在我的正则表达式中发现了一个错误！ 它无法捕获单个字母后跟像A'这样的撇号。 固定的全新正则表达式在这里：

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

Answer 2

你的正则表达式中有太多的捕获组； 使它们不被捕获：

(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)

演示：

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']

那只返回一个空元素。

Answer 3

这个正则表达式只允许一个结尾撇号，后面可以跟一个字符：

([\w][\w]*'?\w?)

演示：

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]

Answer 4

我是 python 新手，但我想我已经弄明白了

import re
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
result = re.findall(r"(.+?)[\s'\",!]{1,}", s)
print(result)

结果 ['John', 's', 'mom', 'went', 'there', 'but', 'he', 'wasn', 't', 'there.', 'So', 'she' , '说：', '哪里', '是', '你']

正则表达式在 Python 中拆分单词

问题描述

4 个解决方案

解决方案1
25 已采纳 2012-10-03 09:25:37

解决方案2
8 2012-10-03 09:14:56

解决方案3
2 2013-05-02 22:32:03

解决方案4
0 2021-05-14 10:58:53

正则表达式在 Python 中拆分单词

问题描述

4 个解决方案

解决方案1 25 已采纳 2012-10-03 09:25:37

解决方案2 8 2012-10-03 09:14:56

解决方案3 2 2013-05-02 22:32:03

解决方案4 0 2021-05-14 10:58:53

解决方案1
25 已采纳 2012-10-03 09:25:37

解决方案2
8 2012-10-03 09:14:56

解决方案3
2 2013-05-02 22:32:03

解决方案4
0 2021-05-14 10:58:53