如何從字符串python中刪除所有表情符號（unicode）字符

Question

我有以下字符串：

tweet = "Get $10 worth of AMAL!!\\nThis campaign will be final AirDrop before official release!!\\nhttps://form.run/@airdrop-e\xa0\\n\\nRT please!\\n\\n#amanpuri #AMAL\\n#BTC #XRP #ETH \\n#cryptocurrency  \\n#China #bitcoin \\n#\\xe3\\x82\\xa2\\xe3\\x83\\x9e\\xe3\\x83\\xb3\\xe3\\x83\\x97\\xe3\\x83\\xaa"

我需要清理它，但我堅持去掉字符串末尾的符號，也就是\\\\n#\\\\xe3\\\\x82\\\\xa2\\\\xe3這些很可能是 unicode 符號、表情符號和新的行符號\\\\n這是我所做的：

pat1 = r'@[A-Za-z0-9]+' # this is to remove any text with @ (links)
pat2 = r'https?://[A-Za-z0-9./]+'  # this is to remove the urls
pat3 = r'[^a-zA-Z0-9$]' # to remove every other character except a-z & 0-9 & $
combined_pat2 = r'|'.join((r'|'.join((pat1, pat2)),pat3)) # combine pat1, pat2 and pat3 to pass it in the cleaning steps

我獲得以下輸出：

get $10 worth of amal   nthis campaign will be final airdrop before official release   n   e  n nrt please  n n amanpuri  amal n btc  xrp  eth  n cryptocurrency   n china  bitcoin  n  xe3 x82 xa2 xe3 x83 x9e xe3 x83 xb3 xe3 x83 x97 xe3 x83 xaa

所以我仍然擁有所有這些n和xe3任何人都可以為此目的建議一個 python 正則表達式嗎？ 提前謝謝。

Answer 1

這些不是字符。 他們是逃兵。 您可以使用此正則表達式匹配它們：

r'\\(n|x..)'

如果要刪除它們，請使用：

import re
tweet = re.sub(r'\\(n|x..)', '', tweet)

如何從字符串python中刪除所有表情符號（unicode）字符

問題描述

1 個解決方案

解決方案1
1 已采納 2019-12-04 01:53:45

如何從字符串python中刪除所有表情符號（unicode）字符

問題描述

1 個解決方案

解決方案1 1 已采納 2019-12-04 01:53:45

解決方案1
1 已采納 2019-12-04 01:53:45