当需要在python中将字符串条带化为单词时使用re.findall

Question

I'm using re.findall like this: 我正在使用这样的re.findall ：

x=re.findall('\w+', text)

so I'm getting a list of words matching the characters [a-zA-Z0-9] . 所以我得到一个与字符匹配的单词列表[a-zA-Z0-9] 。 the problem is when I'm using this input: 问题是当我使用这个输入时：

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~:

I want to get an empty list, but im getting [' ', ' ']. 我想获得一个空列表，但我得到[' '，' ']。 how could I exclude those underscores? 我怎么能排除那些下划线？

Answer 1

Use just the [a-zA-Z0-9] pattern; 仅使用[a-zA-Z0-9]模式; \\w includes underscores: \\w包括下划线：

x = re.findall('[a-zA-Z0-9]+', text)

or use the inverse of \\w , \\W in a negative character set with _ added: 或者在_添加的负字符集中使用\\w ， \\W的倒数：

x = re.findall('[^\W_]+', text)

The latter has the advantage of working correctly even when using re.UNICODE or re.LOCALE , where \\w matches a wider range of characters. 后者具有即使在使用re.UNICODE或re.LOCALE时也能正常工作的优点，其中\\w匹配更广泛的字符。

Demo: 演示：

>>> import re
>>> text = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~:'
>>> re.findall('[^\W_]+', text)
[]
>>> re.findall('[^\W_]+', 'The foo bar baz! And the eggs, ham and spam?')
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']

Answer 2

You can use groupby for this too 您也可以使用groupby

from itertools import groupby
x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]

eg. 例如。

>>> text = 'The foo bar baz! And the eggs, ham and spam?'
>>> x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
>>> x
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']

当需要在python中将字符串条带化为单词时使用re.findall

问题描述

2 个解决方案

解决方案1
3 2013-12-03 09:02:55

解决方案2
0 2013-12-03 09:08:05

当需要在python中将字符串条带化为单词时使用re.findall

问题描述

2 个解决方案

解决方案1 3 2013-12-03 09:02:55

解决方案2 0 2013-12-03 09:08:05

解决方案1
3 2013-12-03 09:02:55

解决方案2
0 2013-12-03 09:08:05