如何從 spark dataframe 中的文本列中刪除額外的轉義字符

Question

我在 json 中的數據看起來像 -

{"text": "\"I have recently taken out a 12 month mobile phone contract with Virgin but despite two calls to customer help I still am getting a message on my phone indicating \\\"No Service\\\" although intermittently I do get connected.\"", "created_at": "\"2018-08-27 16:58:30\"", "service_id": "51870", "category_id": "249"}

我讀了這個 JSON 使用 -

val complaintsSourceRaw = spark.read.json("file:///complaints.jsonl")

當我讀取 dataframe 中的數據時，它看起來像

|249        |"2018-08-27 16:58:30"|51870     |"I have recently taken out a 12 month mobile phone contract with Virgin but despite two calls to customer help I still am getting a message on my phone indicating **\"No Service\"** although intermittently I do get connected."

問題是

 **\"No Service\"**  need to be like  **"No Service"**

我是如何嘗試的-

complaintsSourceRaw.withColumn("text_cleaned", functions.regexp_replace(complaintsSourceRaw.col("text"), "\", ""));

但是 \ 字符使我的 " 和代碼中斷。知道如何實現這一點嗎？

Answer 1

您需要轉義“\”字符，因此在您的 regexp_replace 中您應該尋找兩個反斜杠 ("\\") 字符，而不是一個。

如何從 spark dataframe 中的文本列中刪除額外的轉義字符

問題描述

1 個解決方案

解決方案1
0 2022-01-25 12:07:03

如何從 spark dataframe 中的文本列中刪除額外的轉義字符

問題描述

1 個解決方案

解決方案1 0 2022-01-25 12:07:03

解決方案1
0 2022-01-25 12:07:03