如何从python中的文本中提取仅包含字母的单词？

Question

例如在以下文本中：

"We’d love t0 help 123you, but the real1ty is th@t n0t every question gets answered. To improve your chances, here are some tips:"

如何轻松提取仅包含字母的单词：

love, help, but,... To,... tips

我试过

words = re.findall(r'^[a-zA-Z]+',str)
    for word in words:
print word

其中str是文本。 这做了一些工作，但我需要以某种方式调整它。

任何想法如何使用正则表达式做到这一点？

Answer 1

您可以使用列表理解。

s = "We’d love t0 help 123you, but the real1ty is th@t n0t every question gets answered. To improve your chances, here are some tips:"
print [i for i in s.split() if i.isalpha()]

s.split()将根据空格分割输入。
只需迭代返回的项目并考虑那些完全包含字母的项目。

Answer 2

用

re.findall(r'(?<!\S)[A-Za-z]+(?!\S)', x)
re.findall(r'\b[A-Za-z]+\b', x)

或者使用 Unicode 支持：

re.findall(r'(?<!\S)[^\W\d_]+(?!\S)', x)
re.findall(r'\b[^\W\d_]+\b', x)

请参阅正则表达式证明。

使用(?<!\\S)和(?!\\S)查找空格内的单词。 如果您需要标点符号和空格之间的单词，请使用\\b 。

解释

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  [A-Za-z]+                any character of: 'A' to 'Z', 'a' to 'z'
                           (1 or more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
[^\W\d_]+                any character except: non-word characters
                           (all but a-z, A-Z, 0-9, _), digits (0-9),
                           '_' (1 or more times (matching the most
                           amount possible))
---------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-ahead

如何从python中的文本中提取仅包含字母的单词？

问题描述

2 个解决方案

解决方案1
4 已采纳 2015-09-13 17:26:13

解决方案2
0 2021-05-06 23:52:02

如何从python中的文本中提取仅包含字母的单词？

问题描述

2 个解决方案

解决方案1 4 已采纳 2015-09-13 17:26:13

解决方案2 0 2021-05-06 23:52:02

解决方案1
4 已采纳 2015-09-13 17:26:13

解决方案2
0 2021-05-06 23:52:02