簡體   English   中英

已清理的消息,僅包含字母 az 和數字 0-9,只有一個空格

[英]cleaned message, which contains only letters a-z, and numbers 0-9 with only one space

已清理的消息,僅包含字母 az 和數字 0-9,單詞之間只有一個空格

def clean_data(message):
    return " ".join("".join(re.findall("[a-zA-Z0-9_ ]", message)).lower().split())

sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser;   ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'
cleaned:  doesnt get howto operate667 afterit lt gt wont orwhat
cleaned:  oklar7idouble check wif da hair dresser already he said77885 wun cut v short questionstd txt ratetcs

預期 Output:

cleaned:    doesn t get how to operate 66 7 after it lt gt won t or what
cleaned:    o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s

從您發布的內容中,我們可以推斷出以下幾點:

  • Output 應該只有字母數字字符
  • 不是字母數字字符的所有內容都被替換為空格
  • 我們不應該有多個相鄰的空格
  • Output 僅小寫

假設這是預期的行為,代碼非常簡單:

def clean_data(message):
    return re.sub(r"[^\w]+", " ", message).lower()
  1. 您將使用[^\w]+獲取大量不需要的字符,並在re.sub()的幫助下將它們替換為單個空格。

  2. 我們使用.lower()將所有內容轉換為小寫


sentence_1 = 'Doesn\'t get, how{to}% \\operate+66.7 :after[it]"" & lt;# & gt; won\'t `or(what)'
sentence_2 = 'O\]k,.lar7i$double{} check wif*& da! hair: [dresser;   ..already He SaID-77.88.5 wun cut v short question(std txt rate)T&C\'s'

print(clean_data(sentence_1))
print(clean_data(sentence_2))

>>> doesn t get how to operate 66 7 after it lt gt won t or what 
>>> o k lar7i double check wif da hair dresser already he said 77 88 5 wun cut v short question std txt rate t c s

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM