[英]How to extract words containing only letters from a text in python?
For example in the following text:例如在以下文本中:
"We’d love t0 help 123you, but the real1ty is th@t n0t every question gets answered. To improve your chances, here are some tips:"
How to easily extract words containing only letters:如何轻松提取仅包含字母的单词:
love, help, but,... To,... tips
I tried我试过
words = re.findall(r'^[a-zA-Z]+',str)
for word in words:
print word
where str
is the text.其中str
是文本。 This does some work but I need to tweak it somehow.这做了一些工作,但我需要以某种方式调整它。
Any ideas how to do it with regular expressions?任何想法如何使用正则表达式做到这一点?
You may use list comprehension.您可以使用列表理解。
s = "We’d love t0 help 123you, but the real1ty is th@t n0t every question gets answered. To improve your chances, here are some tips:"
print [i for i in s.split() if i.isalpha()]
s.split()
will split the input according to the spaces. s.split()
将根据空格分割输入。Use用
re.findall(r'(?<!\S)[A-Za-z]+(?!\S)', x)
re.findall(r'\b[A-Za-z]+\b', x)
Or with Unicode support:或者使用 Unicode 支持:
re.findall(r'(?<!\S)[^\W\d_]+(?!\S)', x)
re.findall(r'\b[^\W\d_]+\b', x)
See regex proof .请参阅正则表达式证明。
Use (?<!\\S)
and (?!\\S)
to find words inside whitespace.使用(?<!\\S)
和(?!\\S)
查找空格内的单词。 Use \\b
if you need words between punctuation and whitespace.如果您需要标点符号和空格之间的单词,请使用\\b
。
EXPLANATION解释
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
[A-Za-z]+ any character of: 'A' to 'Z', 'a' to 'z'
(1 or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[^\W\d_]+ any character except: non-word characters
(all but a-z, A-Z, 0-9, _), digits (0-9),
'_' (1 or more times (matching the most
amount possible))
---------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-ahead
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.