如何从 spark dataframe 中的文本列中删除额外的转义字符

Question

My Data in a json looks like -我在 json 中的数据看起来像 -

{"text": "\"I have recently taken out a 12 month mobile phone contract with Virgin but despite two calls to customer help I still am getting a message on my phone indicating \\\"No Service\\\" although intermittently I do get connected.\"", "created_at": "\"2018-08-27 16:58:30\"", "service_id": "51870", "category_id": "249"}

I read this JSON Using -我读了这个 JSON 使用 -

val complaintsSourceRaw = spark.read.json("file:///complaints.jsonl")

When i read the data in dataframe, it looks like当我读取 dataframe 中的数据时，它看起来像

|249        |"2018-08-27 16:58:30"|51870     |"I have recently taken out a 12 month mobile phone contract with Virgin but despite two calls to customer help I still am getting a message on my phone indicating **\"No Service\"** although intermittently I do get connected."

Issue is问题是

 **\"No Service\"**  need to be like  **"No Service"**

How i am trying -我是如何尝试的-

complaintsSourceRaw.withColumn("text_cleaned", functions.regexp_replace(complaintsSourceRaw.col("text"), "\", ""));

However \ character excapes my " and code breaks. Any idea how to acieve this?但是 \ 字符使我的 " 和代码中断。知道如何实现这一点吗？

Answer 1

You need to escape the "\" character, so in your regexp_replace you should look for two backslash ("\\") characters, not one.您需要转义“\”字符，因此在您的 regexp_replace 中您应该寻找两个反斜杠 ("\\") 字符，而不是一个。

如何从 spark dataframe 中的文本列中删除额外的转义字符

问题描述

1 个解决方案

解决方案1
0 2022-01-25 12:07:03

如何从 spark dataframe 中的文本列中删除额外的转义字符

问题描述

1 个解决方案

解决方案1 0 2022-01-25 12:07:03

解决方案1
0 2022-01-25 12:07:03