Python正則表達式查找文本中的所有單詞

Question

我知道這聽起來很簡單，但是由於某種原因，我無法獲得所需的所有結果

在這種情況下，單詞是除空格以外的任何char字符，例如，在以下字符串中用空格分隔：“ Hello there stackoverflow”。 結果應該是：['Hello'，'there'，'stackoverflow。']

我的代碼：

import re

word_pattern = "^\S*\s|\s\S*\s|\s\S*$"
result = re.findall(word_pattern,text)
print result

但是在像我所示的字符串上使用此模式之后，它僅將列表中的第一個和最后一個單詞放進去，而不是用兩個空格分開的單詞

這種模式有什么問題？

Answer 1

請使用\\b邊界測試：

r'\b\S+\b'

結果：

>>> import re
>>> re.findall(r'\b\S+\b', 'Hello there StackOverflow.')
['Hello', 'there', 'StackOverflow']

或根本不使用正則表達式，而只使用.split() ； 后者將在句子中包括標點（上面的正則表達式與句子中的.不匹配）。

Answer 2

查找字符串中的所有單詞最好使用split

>>> "Hello there stackoverflow.".split()
['Hello', 'there', 'stackoverflow.']

但是如果必須使用正則表達式，則應將正則表達式更改為更簡單，更快速的表達式： r'\\b\\S+\\b' 。

因此，這意味着找到所有可見的字符集（單詞/數字）。

Answer 3

簡單使用-

>>> s = "Hello there stackoverflow."
>>> s.split()
['Hello', 'there', 'stackoverflow.']

Answer 4

其他答案很好。 根據您想要的內容（例如，包含/排除標點符號或其他非單詞字符），另一種選擇是使用正則表達式將一個或多個空格字符分開：

re.split(r'\s+', 'Hello there   StackOverflow.')
['Hello', 'There', 'StackOverflow.']