如何在pyspark中刪除特殊字符，unicode emojis？

Question

大家下午好，我在清除數據框字符串列中的特殊字符時遇到問題，我只想刪除特殊字符，例如 html 組件、表情符號和 unicode 錯誤，例如\– 。

有沒有人有正則表達式來幫助我？ 或有關如何處理此問題的任何建議？

輸入：

i want to remove 😃 and codes "\u2022"

預期輸出：

i want to remove and codes

我試過：

re.sub('[^A-Za-z0-9 \u2022]+', '', nome)

regexp_replace('nome', '\r\n|/[\x00-\x1F\x7F]/u', ' ')

Answer 1

import re
L=re.findall(r"[^😃•]+", "abdasfrasdfadfs😃adfaa😃•sdf•adsfasfasfasf")
print(L) # prints ['abdasfrasdfadfs', 'adfaa', 'sdf', 'adsfasfasfasf']

因此，要刪除笑臉和子彈表情符號 (\•)，您可以應用上面的模式，調用 findall 方法，然后加入返回的列表。 像下面這樣：

import re
given_string = "😃Your input• string 😃"
result_string = "".join(re.findall(r"[^😃•]+", given_string))
print(result_string) #prints 'Your input string '

如果您知道表情符號的 Unicode 編號，您可以將表情符號替換為 Unicode 編號，如下所示：

result_string = "".join(re.findall(r"[^😃\u2022]+", given_string))

Answer 2

您可以使用此正則表達式從帶有regexp_replace函數的列中刪除所有 unicode 字符。 然后刪除可以保留的額外雙引號：

import pyspark.sql.functions as F

df = spark.createDataFrame([('i want to remove 😃 and codes "\u2022"',)], ["value"])

df = df.withColumn(
    "value_2",
    F.regexp_replace(F.regexp_replace("value", "[^\x00-\x7F]+", ""), '""', '')
)

df.show(truncate=False)

#+---------------------------------+----------------------------+
#|value                            |value_2                     |
#+---------------------------------+----------------------------+
#|i want to remove 😃 and codes "•"|i want to remove  and codes |
#+---------------------------------+----------------------------+

Answer 3

上述解決方案均無效，有人有任何建議嗎？

如何在pyspark中刪除特殊字符，unicode emojis？

問題描述

2 個解決方案

解決方案1
0 2021-11-05 23:24:07

解決方案2
0 2021-11-06 11:40:08

解決方案3
-2 2021-11-06 16:28:38

如何在pyspark中刪除特殊字符，unicode emojis？

問題描述

2 個解決方案

解決方案1 0 2021-11-05 23:24:07

解決方案2 0 2021-11-06 11:40:08

解決方案3 -2 2021-11-06 16:28:38

解決方案1
0 2021-11-05 23:24:07

解決方案2
0 2021-11-06 11:40:08

解決方案3
-2 2021-11-06 16:28:38