简体   繁体   English

使用模式中的所有键将 spark 数据集写入 json,包括 null 列

[英]write a spark Dataset to json with all keys in the schema, including null columns

I am writing a dataset to json using:我正在使用以下方法将数据集写入 json:

ds.coalesce(1).write.format("json").option("nullValue",null).save("project/src/test/resources")

For records that have columns with null values, the json document does not write that key at all.对于具有 null 值的列的记录,json 文档根本不会写入该键。

Is there a way to enforce null value keys to the json output?有没有办法对 json output 强制执行 null 值键?

This is needed since I use this json to read it onto another dataset (in a test case) and cannot enforce a schema if some documents do not have all the keys in the case class (I am reading it by putting the json file under resources folder and transforming to a dataset via RDD[String], as explained here: https://databaseline.bitbucket.io/a-quickie-on-reading-json-resource-files-in-apache-spark/ )这是必需的,因为我使用此 json 将其读取到另一个数据集(在测试用例中)并且如果某些文档没有 class 案例中的所有键(我正在通过将 Z466DEEC76ECDF5FCA6DDDD5D5D5D571F6324 文件放在资源下阅读它)文件夹并通过 RDD[String] 转换为数据集,如下所述: https://databaseline.bitbucket.io/a-quickie-on-reading-json-resource-files-in-apache-spark/

I agree with @philantrovert.我同意@philantrovert。

ds.na.fill("")
  .coalesce(1)
  .write
  .format("json")
  .save("project/src/test/resources")

Since DataSets are immutable you are not altering the data in ds and you can process it (complete with null values and all) in any following code.由于DataSets是不可变的,因此您不会更改ds的数据,您可以在以下任何代码中处理它(包括空值和全部)。 You are simply replacing null values with an empty string in the saved file.您只是在保存的文件中用空字符串替换空值。

Since Pyspark 3, one can use the ignoreNullFields option when writing to a JSON file.从 Pyspark 3 开始,可以在写入 JSON 文件时使用ignoreNullFields选项。

spark_dataframe.write.json(output_path,ignoreNullFields=False)

Pyspark docs: https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json Pyspark docs: https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM