在 pyspark dataframe 中用雙引號替換單引號

Question

從下面的代碼我正在寫一個 dataframe 到 csv 文件。

由於我的 dataframe 包含""為None ，我添加了replace("", None)因為Null值應該表示為None而不是"" （雙引號）

newDf.coalesce(1).replace("", None).replace("'", "\"").write.format('csv').option('nullValue', None).option('header', 'true').option('delimiter', '|').mode('overwrite').save(destination_csv)

我嘗試添加.replace("'", "\"").但它不起作用

數據還包含帶單引號的數據

例如：

Survey No. 123, 'Anjanadhri Godowns', CityName

我需要將 dataframe 中的單引號替換為雙引號。

如何實現？

Answer 1

在編寫 output 之前，您可以使用regexp_replace將所有列中的單引號替換為雙引號：

import pyspark.sql.functions as F

df2 = df.select([F.regexp_replace(c, "'", '"').alias(c) for c in df.columns])

# then write output
# df2.coalesce(1).write(...)

Answer 2

使用translate

from pyspark.sql.functions import *

data_list = [(1, "'Name 1'"), (2, "'Name 2' and 'Something'")]
df = spark.createDataFrame(data = data_list, schema = ["ID", "my_col"])
# +---+--------------------+
# | ID|              my_col|
# +---+--------------------+
# |  1|            'Name 1'|
# |  2|'Name 2' and 'Som...|
# +---+--------------------+

df.withColumn('my_col', translate('my_col', "'", '"')).show()
# +---+--------------------+
# | ID|              my_col|
# +---+--------------------+
# |  1|            "Name 1"|
# |  2|"Name 2" and "Som...|
# +---+--------------------+

這將用my_col列中的雙引號替換所有出現的單引號字符。

在 pyspark dataframe 中用雙引號替換單引號

問題描述

2 個解決方案

解決方案1
0 2021-03-29 10:34:41

解決方案2
0 2021-03-29 10:42:44

在 pyspark dataframe 中用雙引號替換單引號

問題描述

2 個解決方案

解決方案1 0 2021-03-29 10:34:41

解決方案2 0 2021-03-29 10:42:44

解決方案1
0 2021-03-29 10:34:41

解決方案2
0 2021-03-29 10:42:44