简体   繁体   English

在 spark 中处理 json 文件

[英]Handling json file in spark

In spark-scala, I need to create data frame by using nested structure json file在 spark-scala 中,我需要使用嵌套结构 json 文件创建数据框

Issue scenerio:问题场景:

I am having an json input with complex nested structure.我有一个具有复杂嵌套结构的 json 输入。 Every day there is chance that some of the keys will not be available on any of the records(keys are optional).some of the keys may not present on day1 and may present in day2 But I am expecting a generic output with all columns expected inspite keys are missing.每天都有可能某些键在任何记录上都不可用(键是可选的)。某些键可能在第 1 天不存在,并且可能在第 2 天出现但我期待一个通用的 output 预期所有列尽管缺少钥匙。 I couldn't use withcolumn function and apply default Vakue since if the key present on a day, that corresponding value should be taken.if I do select, it is failing unable to resolve error since key may not be present on any day Please advise me any solution我不能与列 function 一起使用并应用默认 Vakue,因为如果密钥出现在一天,则应采用相应的值。如果我执行 select,则无法解决错误,因为密钥可能在任何一天都不存在 请告知我有任何解决方案

This is a very common problem in the data ingestion.这是数据摄取中非常常见的问题。 Most of the data require schema evolution ie the schema changes with time.大多数数据需要模式演变,即模式随时间变化。

There are essentially two options.基本上有两种选择。

  1. Pass the schema while reading the dataframe: This works well when you know the superset of all the schema.在读取 dataframe 时传递模式:当您知道所有模式的超集时,这很有效。 Spark will make the missing columns in one day's data as NULL. Spark 会将一天数据中的缺失列设为 NULL。

  2. Evolve the schema using spark schema merging: Spark does schema merging by default.使用 spark 模式合并改进模式:Spark 默认进行模式合并。 You can union the existing snapshot and with the incoming delta and read as json.您可以将现有快照与传入的增量合并并读取为 json。

     val df1 = spark.read.json("/path/snapshot") val df2 = spark.read.json("/path/delta") spark.read.json(df1.toJSON.union(df2.toJSON))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM