[英]pyspark can't stop reading empty string as null (spark 3.0)
I have a some csv data file like this (^ as delmiter):我有一个像这样的 csv 数据文件(^ 作为分隔符):
ID ![]() |
name![]() |
age![]() |
---|---|---|
0 ![]() |
||
1 ![]() |
Mike![]() |
20 ![]() |
When I do当我做
df = spark.read.option("delimiter", "^").option("quote","").option("header", "true").option(
"inferSchema", "true").csv(xxxxxxx)
spark will default the 2 column after 0 row to null spark会将0行后的2列默认为null
df.show():
ID ![]() |
name![]() |
age![]() |
---|---|---|
0 ![]() |
null ![]() |
null ![]() |
1 ![]() |
Mike![]() |
20 ![]() |
How can I stop pyspark to read the data as null but just empty string?如何停止 pyspark 将数据读取为 null 但只是空字符串?
I have tried add some option in the end我试过最后添加一些选项
1,option("nullValue", "xxxx").option("treatEmptyValuesAsNulls", False)
2,option("nullValue", None).option("treatEmptyValuesAsNulls", False)
3,option("nullValue", None).option("emptyValue", None)
4,option("nullValue", "xxx").option("emptyValue", "xxx")
But no matter what I do pyspark is still reading the data as null.. Is there a way to make pyspark read the empty string as it is?但无论我做什么 pyspark 仍在读取数据为 null .. 有没有办法让 pyspark 按原样读取空字符串?
Thanks谢谢
It looks that the empty values since Spark Version 2.0.1
are treated as null.看起来自
Spark Version 2.0.1
以来的空值被视为 null。 A manner to achieve your result is using df.na.fill(...)
:实现结果的一种方法是使用
df.na.fill(...)
:
df = spark.read.csv('your_data_path', sep='^', header=True)
# root
# |-- ID: string (nullable = true)
# |-- name: string (nullable = true)
# |-- age: string (nullable = true)
# Fill all columns
# df = df.na.fill('')
# Fill specific columns
df = df.na.fill('', subset=['name', 'age'])
df.show(truncate=False)
Output Output
+---+----+---+
|ID |name|age|
+---+----+---+
|0 | | |
|1 |Mike|20 |
+---+----+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.