Python正则表达式查找文本中的所有单词

Question

This sounds very simple, I know, but for some reason I can't get all the results I need 我知道这听起来很简单，但是由于某种原因，我无法获得所需的所有结果

Word in this case is any char but white-space that is separetaed with white-space for example in the following string: "Hello there stackoverflow." 在这种情况下，单词是除空格以外的任何char字符，例如，在以下字符串中用空格分隔：“ Hello there stackoverflow”。 the result should be: ['Hello','there','stackoverflow.'] 结果应该是：['Hello'，'there'，'stackoverflow。']

My code: 我的代码：

import re

word_pattern = "^\S*\s|\s\S*\s|\s\S*$"
result = re.findall(word_pattern,text)
print result

but after using this pattern on a string like I've shown it only puts the first and the last words in the list and not the words separeted with two spaces 但是在像我所示的字符串上使用此模式之后，它仅将列表中的第一个和最后一个单词放进去，而不是用两个空格分开的单词

What is the problem with this pattern? 这种模式有什么问题？

Answer 1

Use the \\b boundary test instead: 请使用\\b边界测试：

r'\b\S+\b'

Result: 结果：

>>> import re
>>> re.findall(r'\b\S+\b', 'Hello there StackOverflow.')
['Hello', 'there', 'StackOverflow']

or not use a regular expression at all and just use .split() ; 或根本不使用正则表达式，而只使用.split() ； the latter would include the punctiation in a sentence (the regex above did not match the . in the sentence). 后者将在句子中包括标点（上面的正则表达式与句子中的.不匹配）。

Answer 2

to find all words in a string best use split 查找字符串中的所有单词最好使用split

>>> "Hello there stackoverflow.".split()
['Hello', 'there', 'stackoverflow.']

but if you must use regular expressions, then you should change your regex to something simpler and faster: r'\\b\\S+\\b' . 但是如果必须使用正则表达式，则应将正则表达式更改为更简单，更快速的表达式： r'\\b\\S+\\b' 。

r turns the string to a 'raw' string. r将字符串转换为“原始”字符串。 meaning it will not escape your characters. 表示它不会逃脱您的角色。
\\b means a boundary, which is a space, newline, or punctuation. \\b表示边界，它是空格，换行符或标点符号。
\\S you should know, is any non-whitespace character. 您应该知道\\S是任何非空白字符。
+ means one or more of the previous. +表示上一个或多个。

so together it means find all visible sets of characters (words/numbers). 因此，这意味着找到所有可见的字符集（单词/数字）。

Answer 3

How about simply using - 简单使用-

>>> s = "Hello there stackoverflow."
>>> s.split()
['Hello', 'there', 'stackoverflow.']

Answer 4

The other answers are good. 其他答案很好。 Depending on what you want (eg. include/exclude punctuation or other non-word characters) an alternative could be to use a regex to split by one or more whitespace characters: 根据您想要的内容（例如，包含/排除标点符号或其他非单词字符），另一种选择是使用正则表达式将一个或多个空格字符分开：

re.split(r'\s+', 'Hello there   StackOverflow.')
['Hello', 'There', 'StackOverflow.']

Python正则表达式查找文本中的所有单词

问题描述

4 个解决方案

解决方案1
5 已采纳 2013-01-03 11:40:01

解决方案2
2 2013-01-03 11:39:51

解决方案3
0 2013-01-03 11:40:51

解决方案4
0 2013-01-03 11:45:35

Python正则表达式查找文本中的所有单词

问题描述

4 个解决方案

解决方案1 5 已采纳 2013-01-03 11:40:01

解决方案2 2 2013-01-03 11:39:51

解决方案3 0 2013-01-03 11:40:51

解决方案4 0 2013-01-03 11:45:35

解决方案1
5 已采纳 2013-01-03 11:40:01

解决方案2
2 2013-01-03 11:39:51

解决方案3
0 2013-01-03 11:40:51

解决方案4
0 2013-01-03 11:45:35