已清理的消息，仅包含字母 az 和数字 0-9，只有一个空格

Question

cleaned message, which contains only letters az, and numbers 0-9, with only one space between words已清理的消息，仅包含字母 az 和数字 0-9，单词之间只有一个空格

def clean_data(message):
    return " ".join("".join(re.findall("[a-zA-Z0-9_ ]", message)).lower().split())

sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser;   ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'

cleaned:  doesnt get howto operate667 afterit lt gt wont orwhat
cleaned:  oklar7idouble check wif da hair dresser already he said77885 wun cut v short questionstd txt ratetcs

Expected Output:预期 Output：

cleaned:    doesn t get how to operate 66 7 after it lt gt won t or what
cleaned:    o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s

Answer 1

From what you have posted we can deduce a couple of things:从您发布的内容中，我们可以推断出以下几点：

Output should have alphanumeric characters only Output 应该只有字母数字字符
Everything that wasn't an alphanumeric character get's replaced with a whitespace不是字母数字字符的所有内容都被替换为空格
We shouldn't have multiple spaces next to each other我们不应该有多个相邻的空格
Output is lowercase only Output 仅小写

Assuming that's the expected behavior, the code is very straight forward:假设这是预期的行为，代码非常简单：

def clean_data(message):
    return re.sub(r"[^\w]+", " ", message).lower()

You're getting chunks of unwanted characters with [^\w]+ and replacing them with a singular whitespace with the help of re.sub() .您将使用[^\w]+获取大量不需要的字符，并在re.sub()的帮助下将它们替换为单个空格。
We convert everything to lowercase with.lower()我们使用.lower()将所有内容转换为小写

sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser;   ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'

print(clean_data(sentence_1))
print(clean_data(sentence_2))

>>> doesn t get how to operate 66 7 after it lt gt won t or what 
>>> o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s

已清理的消息，仅包含字母 az 和数字 0-9，只有一个空格

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-09-05 11:04:59

已清理的消息，仅包含字母 az 和数字 0-9，只有一个空格

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-09-05 11:04:59

解决方案1
0 已采纳 2022-09-05 11:04:59