正则表达式跳过某些特定字符

Question

I am trying to clean the string such that it does not have any punctuation or number, it must only have az and AZ. 我试图清理字符串，使其没有任何标点或数字，它必须只有az和AZ。 For example,given String is: 例如，给定String是：

"coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"

Required output is : 所需的输出是：

['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

My solution is 我的解决方案是

re.findall(r"([A-Za-z]+)" ,string)

My output is 我的输出是

['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']

Answer 1

You don't need to use regular expression: 您不需要使用正则表达式：

(Convert the string into lower case if you want all lower-cased words), Split words, then filter out word that starts with alphabet: （如果你想要所有小写单词，请将字符串转换为小写），拆分单词，然后过滤掉以字母开头的单词：

>>> s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()]
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

In Python 3.x, filter(str.isalpha, word) should be replaced with ''.join(filter(str.isalpha, word)) , because in Python 3.x, filter returns a filter object. 在Python 3.x中， filter(str.isalpha, word)应替换为''.join(filter(str.isalpha, word)) ，因为在Python 3.x中， filter返回一个过滤器对象。

Answer 2

With the recommendation of all of the people who answered I got the correct solution that i really wants , Thanks to every one... 在所有回答的人的推荐下，我得到了我真正想要的正确解决方案，感谢每一个......

s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"    
cleaned = re.sub(r'(<.*>|[^a-zA-Z\s]+)', '', s).split()
print cleaned

Answer 3

using re , although I'm not sure this is what you want because you said you didn't want "cool" leftover. 使用re ，虽然我不确定这是你想要的，因为你说你不想要“酷”剩下的。

import re

s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"

REGEX = r'([^a-zA-Z\s]+)'

cleaned = re.sub(REGEX, '', s).split()
# ['coMPuter', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow', 'cool']

EDIT 编辑

WORD_REGEX = re.compile(r'(?!<?\S+>)(?=\w)(\S+)')
CLEAN_REGEX = re.compile(r'([^a-zA-Z])')

def cleaned(match_obj):
    return re.sub(CLEAN_REGEX, '', match_obj.group(1)).lower()

[cleaned(x) for x in re.finditer(WORD_REGEX, s)]
# ['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

WORD_REGEX uses a positive lookahead for any word characters and a negative lookahead for <...>. WORD_REGEX对任何单词字符使用正向前瞻，对<...>使用负向前瞻。 Whatever non-whitespace that makes it past the lookaheads is grouped: 无论通过前瞻的任何非空白都被分组：

(?!<?\S+>) # negative lookahead
(?=\w) # positive lookahead
(\S+) #group non-whitespace

cleaned takes the match groups and removes any non-word characters with CLEAN_REGEX cleaned使用匹配组并使用CLEAN_REGEX删除任何非单词字符

正则表达式跳过某些特定字符

问题描述

3 个解决方案

解决方案1
5 已采纳 2017-03-04 03:50:12

解决方案2
3 2017-03-04 04:53:41

解决方案3
1 2017-03-04 04:11:06

正则表达式跳过某些特定字符

问题描述

3 个解决方案

解决方案1 5 已采纳 2017-03-04 03:50:12

解决方案2 3 2017-03-04 04:53:41

解决方案3 1 2017-03-04 04:11:06

解决方案1
5 已采纳 2017-03-04 03:50:12

解决方案2
3 2017-03-04 04:53:41

解决方案3
1 2017-03-04 04:11:06