[英]cleaned message, which contains only letters a-z, and numbers 0-9 with only one space
cleaned message, which contains only letters az, and numbers 0-9, with only one space between words已清理的消息,仅包含字母 az 和数字 0-9,单词之间只有一个空格
def clean_data(message):
return " ".join("".join(re.findall("[a-zA-Z0-9_ ]", message)).lower().split())
sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser; ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'
cleaned: doesnt get howto operate667 afterit lt gt wont orwhat
cleaned: oklar7idouble check wif da hair dresser already he said77885 wun cut v short questionstd txt ratetcs
Expected Output:预期 Output:
cleaned: doesn t get how to operate 66 7 after it lt gt won t or what
cleaned: o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s
From what you have posted we can deduce a couple of things:从您发布的内容中,我们可以推断出以下几点:
Assuming that's the expected behavior, the code is very straight forward:假设这是预期的行为,代码非常简单:
def clean_data(message):
return re.sub(r"[^\w]+", " ", message).lower()
You're getting chunks of unwanted characters with [^\w]+
and replacing them with a singular whitespace with the help of re.sub()
.您将使用[^\w]+
获取大量不需要的字符,并在re.sub()
的帮助下将它们替换为单个空格。
We convert everything to lowercase with.lower()
我们使用.lower()
将所有内容转换为小写
sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser; ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'
print(clean_data(sentence_1))
print(clean_data(sentence_2))
>>> doesn t get how to operate 66 7 after it lt gt won t or what
>>> o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.