pyspark 无法停止读取空字符串为 null (spark 3.0)

Question

I have a some csv data file like this (^ as delmiter):我有一个像这样的 csv 数据文件（^ 作为分隔符）：

ID ID	name姓名	age年龄
0 0
1 1	Mike麦克风	20 20

When I do当我做

df = spark.read.option("delimiter", "^").option("quote","").option("header", "true").option(
        "inferSchema", "true").csv(xxxxxxx)

spark will default the 2 column after 0 row to null spark会将0行后的2列默认为null

df.show():

ID ID	name姓名	age年龄
0 0	null null	null null
1 1	Mike麦克风	20 20

How can I stop pyspark to read the data as null but just empty string?如何停止 pyspark 将数据读取为 null 但只是空字符串？

I have tried add some option in the end我试过最后添加一些选项

1,option("nullValue", "xxxx").option("treatEmptyValuesAsNulls", False)
2,option("nullValue", None).option("treatEmptyValuesAsNulls", False)
3,option("nullValue", None).option("emptyValue", None)
4,option("nullValue", "xxx").option("emptyValue", "xxx")

But no matter what I do pyspark is still reading the data as null.. Is there a way to make pyspark read the empty string as it is?但无论我做什么 pyspark 仍在读取数据为 null .. 有没有办法让 pyspark 按原样读取空字符串？

Thanks谢谢

Answer 1

It looks that the empty values since Spark Version 2.0.1 are treated as null.看起来自Spark Version 2.0.1以来的空值被视为 null。 A manner to achieve your result is using df.na.fill(...) :实现结果的一种方法是使用df.na.fill(...) ：

df = spark.read.csv('your_data_path', sep='^', header=True)
# root
#  |-- ID: string (nullable = true)
#  |-- name: string (nullable = true)
#  |-- age: string (nullable = true)

# Fill all columns
# df = df.na.fill('')

# Fill specific columns
df = df.na.fill('', subset=['name', 'age'])

df.show(truncate=False)

Output Output

+---+----+---+
|ID |name|age|
+---+----+---+
|0  |    |   |
|1  |Mike|20 |
+---+----+---+

pyspark 无法停止读取空字符串为 null (spark 3.0)

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-24 16:23:49

pyspark 无法停止读取空字符串为 null (spark 3.0)

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-24 16:23:49

解决方案1
1 已采纳 2021-05-24 16:23:49