[英]using re.findall when in need of striping a string into words in python
I'm using re.findall
like this: 我正在使用这样的
re.findall
:
x=re.findall('\w+', text)
so I'm getting a list of words matching the characters [a-zA-Z0-9]
. 所以我得到一个与字符匹配的单词列表
[a-zA-Z0-9]
。 the problem is when I'm using this input: 问题是当我使用这个输入时:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~:
I want to get an empty list, but im getting [' ', ' ']. 我想获得一个空列表,但我得到[' ',' ']。 how could I exclude those underscores?
我怎么能排除那些下划线?
Use just the [a-zA-Z0-9]
pattern; 仅使用
[a-zA-Z0-9]
模式; \\w
includes underscores: \\w
包括下划线:
x = re.findall('[a-zA-Z0-9]+', text)
or use the inverse of \\w
, \\W
in a negative character set with _
added: 或者在
_
添加的负字符集中使用\\w
, \\W
的倒数:
x = re.findall('[^\W_]+', text)
The latter has the advantage of working correctly even when using re.UNICODE
or re.LOCALE
, where \\w
matches a wider range of characters. 后者具有即使在使用
re.UNICODE
或re.LOCALE
时也能正常工作的优点,其中\\w
匹配更广泛的字符。
Demo: 演示:
>>> import re
>>> text = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~:'
>>> re.findall('[^\W_]+', text)
[]
>>> re.findall('[^\W_]+', 'The foo bar baz! And the eggs, ham and spam?')
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']
You can use groupby for this too 您也可以使用groupby
from itertools import groupby
x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
eg. 例如。
>>> text = 'The foo bar baz! And the eggs, ham and spam?'
>>> x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
>>> x
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.