简体   繁体   中英

Using Azure Synapse pyspark filter or flatten the nested json objects based on nested object's data type

I am working on Azure Synapse pyspark on flattening the nested json data. json file has json objects with nested data as shown below, here cords is of type struct for 1 and 3rd record and string for 2nd record. When I printed the schema using df.printSchema() it prints cords type as string, if i remove the 2nd row json object then it prints the schema of type struct. Here I want to filter the json objects based on the cords data type so that I can flatten the cords struct nested data. where as for 2nd record flattening is not required. please can one help me on this?

{"dateTime":"2020-11-29T13:51:16.168659Z","cords":{"x_al":0.0191342489,"y_al":-0.1200904993}}

{"dateTime":"2020-12-29T13:51:21.457739Z","cords":51.0}

{"dateTime":"2021-10-29T13:51:26.634289Z","cords":{"x_al":0.01600042489,"y_al":-0.1200900993}}

You can import pandas into your code and them load the data with it as below:

df = pd.DataFrame([flatten_json(data)])

From the above code line, we assume that "data" is variable which is storing JSON structured data.

Also we have multiple scenarios in flatten in the data about per your three json types.

  • If you have just a dict, then you can use flatten_json(data)
  • If you have multiple dicts like [{},{}.{}] , then you can use as [flatten_json(x) for x in data]
  • If you have multiple values like {1: {}, 2: {}, 3: {}} then you should use as [flatten_json(data[key]) for key in data.keys()]

For better understanding in Pyspark refer to this blog , thanks to towardsdatascience for clear explanation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM