简体   繁体   English

使用 Azure Synapse pyspark 根据嵌套对象的数据类型过滤或展平嵌套的 json 对象

[英]Using Azure Synapse pyspark filter or flatten the nested json objects based on nested object's data type

I am working on Azure Synapse pyspark on flattening the nested json data.我正在研究 Azure Synapse pyspark 以展平嵌套的 json 数据。 json file has json objects with nested data as shown below, here cords is of type struct for 1 and 3rd record and string for 2nd record. json 文件包含具有嵌套数据的 json 对象,如下所示,这里的 cords 是第 1 和第 3 条记录的结构类型和第 2 条记录的字符串类型。 When I printed the schema using df.printSchema() it prints cords type as string, if i remove the 2nd row json object then it prints the schema of type struct.当我使用 df.printSchema() 打印模式时,它会将线类型打印为字符串,如果我删除第二行 json object 然后它会打印结构类型的模式。 Here I want to filter the json objects based on the cords data type so that I can flatten the cords struct nested data.在这里,我想根据线数据类型过滤 json 对象,以便我可以展平线结构嵌套数据。 where as for 2nd record flattening is not required.至于第二个记录展平是不需要的。 please can one help me on this?请问有人可以帮我吗?

{"dateTime":"2020-11-29T13:51:16.168659Z","cords":{"x_al":0.0191342489,"y_al":-0.1200904993}} {"dateTime":"2020-11-29T13:51:16.168659Z","cords":{"x_al":0.0191342489,"y_al":-0.1200904993}}

{"dateTime":"2020-12-29T13:51:21.457739Z","cords":51.0} {"dateTime":"2020-12-29T13:51:21.457739Z","cords":51.0}

{"dateTime":"2021-10-29T13:51:26.634289Z","cords":{"x_al":0.01600042489,"y_al":-0.1200900993}} {"dateTime":"2021-10-29T13:51:26.634289Z","cords":{"x_al":0.01600042489,"y_al":-0.1200900993}}

You can import pandas into your code and them load the data with it as below:您可以将 pandas 导入您的代码,然后他们使用它加载数据,如下所示:

df = pd.DataFrame([flatten_json(data)])

From the above code line, we assume that "data" is variable which is storing JSON structured data.从上面的代码行中,我们假设“数据”是存储 JSON 结构化数据的变量。

Also we have multiple scenarios in flatten in the data about per your three json types.此外,我们在关于您的三种 json 类型的数据中有多种方案。

  • If you have just a dict, then you can use flatten_json(data)如果你只有一个字典,那么你可以使用flatten_json(data)
  • If you have multiple dicts like [{},{}.{}] , then you can use as [flatten_json(x) for x in data]如果您有多个字典,例如[{},{}.{}] ,那么您可以使用[flatten_json(x) for x in data]
  • If you have multiple values like {1: {}, 2: {}, 3: {}} then you should use as [flatten_json(data[key]) for key in data.keys()]如果您有多个值,例如{1: {}, 2: {}, 3: {}}那么您应该使用 as [flatten_json(data[key]) for key in data.keys()]

For better understanding in Pyspark refer to this blog , thanks to towardsdatascience for clear explanation.为了更好地理解 Pyspark 参考这个博客,感谢向数据科学的清晰解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM