Python中的正则表达式findall（）

Question

如果我有这个字符串：

s = "this, that; talk, love, hate; good, bad, all good."

我想提取用，分隔的项目； 要么。

所以我想要的结果是：

["this", "that", "talk", "love", "hate", "good", "bad", "all good"]

如果我使用此Python正则表达式：

re.findall(r"([a-z]+[,;.])+", s)

我得到结果：

['this,', 'that;', 'talk,', 'love,', 'hate;', 'good,', 'bad,', 'good.']

除了最后一项，它与我想要的接近。

奇怪的是，如果我在第一个方括号中包含一个空格，如下所示：

re.findall(r"([a-z ]+[,;.])+", s)

那么我只会得到以下结果：

[' all good.']

但是findall（）应该找到所有结果，不是吗？ 有人可以解释这种奇怪的行为吗？

Answer 1

您的目标是通过分隔符将字符串分割为标记，因此比起re.findall()更好的方法是使用re.split（）。 在这种情况下，您可以使用

>>> re.split(r"[,;.]\s", s)
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good.']

不幸的是，如果您使用[,;.]\\s作为正则表达式，则此方法要么将句点放在最后一项的末尾，而如果您改为使用[,;.]\\s? ，则在结果列表的末尾添加一个空字符串[,;.]\\s? 作为正则表达式。 但是，我们可以通过删除最后一个字符串来处理此问题：

>>> re.split(r"[,;.]\s?", s)[:-1]
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good']

Answer 2

您可以使用前瞻：

>>> list(re.findall(r"([a-z][a-z ]+(?=[,;.]))+", s))
['this', 'that', 'talk', 'love', 'hate', 'good', 'bad', 'all good']

但是@ murgatroid99推荐的re.split()更好。

Answer 3

您可以使用：

re.findall(r'[\w\s]+', s)

Answer 4

+ （在右引号之前）在括号之外。 将其放入其中，因此：

re.findall(r"\s*([a-z ]+)[ ,;.]+", s)