[英]How to add missing keys with null/blank values in nested JSON which have nested list of dict using python/pyspark
I have a JSON as below :我有一个 JSON 如下:
{"id": 1, "type": "int", "data": {"key0": "val1", "key2": "val2"}}
{"id": 2, "type": "int", "data": {"key2": "val3", "key3": "val4"}}
{"id": 3, "type": "int", "data": {"key1": "val5", "key3": "val6"}}
Now when flattening using pyspark I need to have all the columns as key0,key1,key2,key3 but when selecting columns data.key3 or any other key which is not present in either record the job fails with the error "pyspark.sql.utils.AnalysisException: 'No such struct field" Tried passing schema but the issue still persisted and tried with the withColumn approach using when but that is also failing. Now when flattening using pyspark I need to have all the columns as key0,key1,key2,key3 but when selecting columns data.key3 or any other key which is not present in either record the job fails with the error "pyspark.sql.utils .AnalysisException: 'No such struct field" 尝试传递模式,但问题仍然存在并尝试使用 withColumn 方法使用 when 但这也失败了。 Have someone faced similar type of issue and fixed kindly help.
有没有人遇到过类似的问题并解决了善意的帮助。
Below is the way I am reading schema:以下是我阅读架构的方式:
df_landing = spark.read.format("json").option("multiline", "true").load(input_file)
print(df_landing.printSchema())
below is the result :下面是结果:
root
|-- data: struct (nullable = true)
| |-- key0: string (nullable = true)
| |-- key2: string (nullable = true)
|-- id: long (nullable = true)
|-- type: string (nullable = true)
You should remove the .option("multiline", "true")
, this is when 1 JSON record is go across multiple lines.您应该删除
.option("multiline", "true")
,这是当 1 个 JSON 记录跨越多行时。
Your data is JSONL, each line is valid JSON and each JSON doesn't go across multiple lines.您的数据是 JSONL,每一行都是有效的 JSON,并且每个 JSON 不会跨越多行。
If you remove the option, you should see the schema like this.如果您删除该选项,您应该会看到这样的架构。
root
|-- data: struct (nullable = true)
| |-- key0: string (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- key3: string (nullable = true)
|-- id: long (nullable = true)
|-- type: string (nullable = true)
Then, you can use this code to expand the struct.然后,您可以使用此代码来扩展结构。
df_landing = df_landing.select('id', 'type', 'data.*')
# df_landing.show()
+---+----+----+----+----+----+
| id|type|key0|key1|key2|key3|
+---+----+----+----+----+----+
| 1| int|val1|null|val2|null|
| 2| int|null|null|val3|val4|
| 3| int|null|val5|null|val6|
+---+----+----+----+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.