简体   繁体   English

正则表达式仅提取字母数字单词

[英]Regex to extract ONLY alphanumeric words

I am looking for a regex to extract the word that ONLY contain alphanumeic characters: 我正在寻找一个正则表达式来提取仅包含字母数字字符的单词:

string = 'This is a $dollar sign !!'
matches = re.findall(regex, string)
matches = ['This', 'is', 'sign']

This can be done by tokenizing the string and evaluate each token individually using the following regex: 这可以通过标记字符串并使用以下正则表达式分别评估每个标记来完成:

^[a-zA-Z0-9]+$

Due to performance issues, I want to able to extract the alphanumeric tokens without tokenizing the whole string. 由于性能问题,我希望能够在不标记整个字符串的情况下提取字母数字标记。 The closest I got to was 我最接近的是

regex = \b[a-zA-Z0-9]+\b

, but it still extracts substrings containing alphanumeric characters: ,但仍提取包含字母数字字符的子字符串:

string = 'This is a $dollar sign !!'
matches = re.findall(regex, string)
matches = ['This', 'is', 'dollar', 'sign']

Is there a regex able to pull this off? 是否有正则表达式能够实现这一目标? I've tried different things but can't come up with a solution. 我尝试了不同的方法,但无法提出解决方案。

Instead of word boundaries, lookbehind and lookahead for spaces (or the beginning/end of the string): 除了单词边界之外,还可以向后和向前查找空格(或字符串的开头/结尾):

(?:^|(?<= ))[a-zA-Z0-9]+(?= |$)

https://regex101.com/r/TZ7q1c/1 https://regex101.com/r/TZ7q1c/1

Note that "a" is a standalone alphanumeric word, so it's included too. 请注意,“ a”是一个独立的字母数字单词,因此也包含在内。

['This', 'is', 'a', 'sign']

There is no need to use regexs for this, python has a built in isalnum string method. 无需为此使用正则表达式,python具有内置的isalnum字符串方法。 See below: 见下文:

string = 'This is a $dollar sign !!'

matches = [word for word in string.split(' ') if word.isalnum()]

[Edited thanks to Khabz's comment. [感谢Khabz的评论进行编辑。 I misunderstood the question] 我误解了这个问题]

Depending on your intention, you could also "split" instead of "match". 根据您的意图,您也可以“拆分”而不是“匹配”。

 >>> matches = re.split(r'(?:\s*\S*[\$\!]+\S*\s*|\s+)', string)

 ['This', 'is', 'a', 'sign', '']

And in case you need to remove leading or trailing empty string: 并且如果您需要删除前导或结尾的空字符串:

>>> matches = [x for x in re.split(r'(?:\s*\S*[\$\!]+\S*\s*|\s+)', a) if x ]
['This', 'is', 'a', 'sign']

CertainPerformance's respond using look behind and ahead is the most compact. SomePerformance的使用前后观察的响应是最紧凑的。 Using split is sometimes advantages when the exclusion is specified, ie, the regex above describes what needs to be excluded. 当指定了排除项时,使用split有时是有优势的,即上面的正则表达式描述了需要排除的内容。 In this case, however, it is the inclusion of alpha-numeric that is specified, so using split() is not a good idea. 但是,在这种情况下,指定的是字母数字,因此使用split()不是一个好主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM