简体   繁体   English

pyspark 无法停止读取空字符串为 null (spark 3.0)

[英]pyspark can't stop reading empty string as null (spark 3.0)

I have a some csv data file like this (^ as delmiter):我有一个像这样的 csv 数据文件(^ 作为分隔符):

ID ID name姓名 age年龄
0 0
1 1 Mike麦克风 20 20

When I do当我做

df = spark.read.option("delimiter", "^").option("quote","").option("header", "true").option(
        "inferSchema", "true").csv(xxxxxxx)

spark will default the 2 column after 0 row to null spark会将0行后的2列默认为null

df.show():
ID ID name姓名 age年龄
0 0 null null null null
1 1 Mike麦克风 20 20

How can I stop pyspark to read the data as null but just empty string?如何停止 pyspark 将数据读取为 null 但只是空字符串?

I have tried add some option in the end我试过最后添加一些选项

1,option("nullValue", "xxxx").option("treatEmptyValuesAsNulls", False)
2,option("nullValue", None).option("treatEmptyValuesAsNulls", False)
3,option("nullValue", None).option("emptyValue", None)
4,option("nullValue", "xxx").option("emptyValue", "xxx")

But no matter what I do pyspark is still reading the data as null.. Is there a way to make pyspark read the empty string as it is?但无论我做什么 pyspark 仍在读取数据为 null .. 有没有办法让 pyspark 按原样读取空字符串?

Thanks谢谢

It looks that the empty values since Spark Version 2.0.1 are treated as null.看起来自Spark Version 2.0.1以来的空值被视为 null。 A manner to achieve your result is using df.na.fill(...) :实现结果的一种方法是使用df.na.fill(...)

df = spark.read.csv('your_data_path', sep='^', header=True)
# root
#  |-- ID: string (nullable = true)
#  |-- name: string (nullable = true)
#  |-- age: string (nullable = true)

# Fill all columns
# df = df.na.fill('')

# Fill specific columns
df = df.na.fill('', subset=['name', 'age'])

df.show(truncate=False)

Output Output

+---+----+---+
|ID |name|age|
+---+----+---+
|0  |    |   |
|1  |Mike|20 |
+---+----+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM