简体   繁体   English

已清理的消息,仅包含字母 az 和数字 0-9,只有一个空格

[英]cleaned message, which contains only letters a-z, and numbers 0-9 with only one space

cleaned message, which contains only letters az, and numbers 0-9, with only one space between words已清理的消息,仅包含字母 az 和数字 0-9,单词之间只有一个空格

def clean_data(message):
    return " ".join("".join(re.findall("[a-zA-Z0-9_ ]", message)).lower().split())

sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser;   ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'
cleaned:  doesnt get howto operate667 afterit lt gt wont orwhat
cleaned:  oklar7idouble check wif da hair dresser already he said77885 wun cut v short questionstd txt ratetcs

Expected Output:预期 Output:

cleaned:    doesn t get how to operate 66 7 after it lt gt won t or what
cleaned:    o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s

From what you have posted we can deduce a couple of things:从您发布的内容中,我们可以推断出以下几点:

  • Output should have alphanumeric characters only Output 应该只有字母数字字符
  • Everything that wasn't an alphanumeric character get's replaced with a whitespace不是字母数字字符的所有内容都被替换为空格
  • We shouldn't have multiple spaces next to each other我们不应该有多个相邻的空格
  • Output is lowercase only Output 仅小写

Assuming that's the expected behavior, the code is very straight forward:假设这是预期的行为,代码非常简单:

def clean_data(message):
    return re.sub(r"[^\w]+", " ", message).lower()
  1. You're getting chunks of unwanted characters with [^\w]+ and replacing them with a singular whitespace with the help of re.sub() .您将使用[^\w]+获取大量不需要的字符,并在re.sub()的帮助下将它们替换为单个空格。

  2. We convert everything to lowercase with.lower()我们使用.lower()将所有内容转换为小写


sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser;   ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'

print(clean_data(sentence_1))
print(clean_data(sentence_2))

>>> doesn t get how to operate 66 7 after it lt gt won t or what 
>>> o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 re.sub() 只留下字母 az、AZ、数字 0-9 和空格而不是除数? - How to use re.sub() to leave only letters a-z, A-Z, numbers 0-9 and spaces but not divide numbers? 如何在Python中仅用“ AZ”,“ az”,“ 0-9”和“ _”编码UTF-8字符串 - How to encode UTF-8 strings with only “A-Z”,“a-z”,“0-9”, and “_” in Python 在字符串中查找频率最高的单词并检查字符串是否只包含 [az][AZ] 字符 - Find top frequency word in string and check if string only contains [a-z][A-Z] characters Python-验证以确保输入仅包含字符AZ - Python - Validation to ensure input only contains characters A-Z 如果我只按任何字母(AZ,az)或数字(0 - 9),如何将焦点从 QListWidget 更改为 QLineEdit? - How to Change the Focus from QListWidget to QLineEdit, If I press only any Alphabets (A-Z, a-z) or numbers(0 - 9)? Python计数0-9然后是az - Python count 0-9 then a-z 将int转换为仅包含小写字母和数字并返回的字符串(python) - convert int to string which only contains lowercase letters and numbers and back (python) 具有 0-9 和 AZ 的 Python 序列号生成器 - Python sequential number generator with 0-9 and A-Z 正则表达式过滤列表中的项目,使其仅包含那些包含非z字符的项目 - Regex filter items in list to have only those items which DO contain a character that isn't a-z 如何在Python中使用5位字符编码来编码英文纯文本(仅由字母az和空格组成)? - How to encode English plain-text (consisting only of letters a-z and whitespace) using a 5-bit character encoding in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM