简体   繁体   English

如何使用 python/pyspark 在嵌套的 JSON 中添加具有空值/空白值的缺失键,其中嵌套了 dict 列表

[英]How to add missing keys with null/blank values in nested JSON which have nested list of dict using python/pyspark

I have a JSON as below :我有一个 JSON 如下:

{"id": 1, "type": "int", "data": {"key0": "val1", "key2": "val2"}}
{"id": 2, "type": "int", "data": {"key2": "val3", "key3": "val4"}}
{"id": 3, "type": "int", "data": {"key1": "val5", "key3": "val6"}}

Now when flattening using pyspark I need to have all the columns as key0,key1,key2,key3 but when selecting columns data.key3 or any other key which is not present in either record the job fails with the error "pyspark.sql.utils.AnalysisException: 'No such struct field" Tried passing schema but the issue still persisted and tried with the withColumn approach using when but that is also failing. Now when flattening using pyspark I need to have all the columns as key0,key1,key2,key3 but when selecting columns data.key3 or any other key which is not present in either record the job fails with the error "pyspark.sql.utils .AnalysisException: 'No such struct field" 尝试传递模式,但问题仍然存在并尝试使用 withColumn 方法使用 when 但这也失败了。 Have someone faced similar type of issue and fixed kindly help.有没有人遇到过类似的问题并解决了善意的帮助。

Below is the way I am reading schema:以下是我阅读架构的方式:

df_landing = spark.read.format("json").option("multiline", "true").load(input_file)
print(df_landing.printSchema())

below is the result :下面是结果:

root 
|-- data: struct (nullable = true) 
| |-- key0: string (nullable = true) 
| |-- key2: string (nullable = true) 
|-- id: long (nullable = true) 
|-- type: string (nullable = true)

You should remove the .option("multiline", "true") , this is when 1 JSON record is go across multiple lines.您应该删除.option("multiline", "true") ,这是当 1 个 JSON 记录跨越多行时。

Your data is JSONL, each line is valid JSON and each JSON doesn't go across multiple lines.您的数据是 JSONL,每一行都是有效的 JSON,并且每个 JSON 不会跨越多行。

If you remove the option, you should see the schema like this.如果您删除该选项,您应该会看到这样的架构。

root
 |-- data: struct (nullable = true)
 |    |-- key0: string (nullable = true)
 |    |-- key1: string (nullable = true)
 |    |-- key2: string (nullable = true)
 |    |-- key3: string (nullable = true)
 |-- id: long (nullable = true)
 |-- type: string (nullable = true)

Then, you can use this code to expand the struct.然后,您可以使用此代码来扩展结构。

df_landing = df_landing.select('id', 'type', 'data.*')
# df_landing.show()
+---+----+----+----+----+----+
| id|type|key0|key1|key2|key3|
+---+----+----+----+----+----+
|  1| int|val1|null|val2|null|
|  2| int|null|null|val3|val4|
|  3| int|null|val5|null|val6|
+---+----+----+----+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用dict1的键和dict2的值(这是dict1中的键)从dict1和dict2嵌套的字典 - nested dictionary from dict1 and dict2 using keys from dict1 and values from dict2(which are keys in dict1) 根据列表中的键检索嵌套字典中的值 - Retrieving values in nested dict based on keys in a list 如何在Python中将嵌套字典中的2个键/值上移一个级别 - How to move 2 keys / values in a Nested Dict up one level in Python 如何在嵌套的python dict中提取所有数据键和值 - How to extract all data keys and values in a nested python dict 嵌套dict python键 - nested dict python keys 使用 pyspark 和预定义的结构模式读取嵌套 JSON 时,如何将缺失的列添加为 null - How can missing columns be added as null while read a nested JSON using pyspark and a predefined struct schema 如何在python中合并两个具有相同键和数组类型值的嵌套字典? - How to merge two nested dictionaries in python, which have same keys and have array type values? Python在dict_keys和交换值中查找dict和嵌套值 - Python lookup dict and nested values in dict_keys and swap value Python:如何修改嵌套字典中的值并返回整个字典 - Python: How to modify values in nested dictionary and have the whole dict returned 如何使用Python将列表中的嵌套json存储到文本文件中? - How to store the nested json which is in a list to a text file using Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM