I have nested JSON files with multiple separate branches that can only be joined through the information at the top of the branches. I do not want to cross-join the rows from the different branches. The braches can have inside them lists and dictionaries which can further have other lists and dictionaries.
Following is a sample json file. I have 35 different files with different structures. I want to create separate flat files for each branch that will be stored in separate folders. Later on, data from these files will be processed and queried.
"Shipment": {
"ActualShipmentDate": "2020-03-22",
"EnterpriseCode": "US001",
"EventType": "CONFIRM_SHIPMENT",
"ShipmentNo": "1001816",
"Status": "1444",
"OrderDates": {
"OrderDate": [{
"ActualDate": "2019-08-01",
"DateTypeId": "PROMISE_DATE",
"OrderHeaderKey": "416734325",
"OrderLineKey": "123416734326",
"OrderReleaseKey": "",
"Extn": {
"Ext": [{
"a": 1,
"b": 2,
"c": 3
}, {
"a": 8,
"b": 9
}
]
}
}, {
"ActualDate": "2020-03-22",
"CommittedDate": "2020-03-22",
"DateTypeId": "SHIPPED_OR_CANCELLED",
"OrderHeaderKey": "416734325",
"OrderLineKey": "123416734326",
"OrderReleaseKey": " ",
"RequestedDate": "2020-03-22"
}
]
},
"ShipDates": {
"ShipDate": [{
"ActualDate": "2019-08-01",
"DateTypeId": "PROMISE_DATE",
"OrderHeaderKey": "416734325",
"OrderLineKey": "123416734326",
"Entn": {
"Ext": [{
"p": 1,
"q": 2,
}, {
"p": 9,
}
]
}
}, {
"ActualDate": "2020-03-22",
"CommittedDate": "2020-03-22",
"DateTypeId": "SHIPPED_OR_CANCELLED",
"OrderHeaderKey": "416734325",
"OrderLineKey": "123416734326",
}
]
}
}
The tree structure of the above sample json file is in this image: .
How can I get separate structures in python like in this image:
I'm trying to do this either in an AWS Lambda function or a Glue job.
Thanks a lot in advance for your help.
You can try (but rename the final columns as you need):
data = {
"Shipment": {
"ActualShipmentDate": "2020-03-22",
"EnterpriseCode": "US001",
"EventType": "CONFIRM_SHIPMENT",
"ShipmentNo": "1001816",
"Status": "1444",
"OrderDates": {
"OrderDate": [
{
"ActualDate": "2019-08-01",
"DateTypeId": "PROMISE_DATE",
"OrderHeaderKey": "416734325",
"OrderLineKey": "123416734326",
"OrderReleaseKey": "",
"Extn": {
"Ext": [{"a": 1, "b": 2, "c": 3}, {"a": 8, "b": 9}]
},
},
{
"ActualDate": "2020-03-22",
"CommittedDate": "2020-03-22",
"DateTypeId": "SHIPPED_OR_CANCELLED",
"OrderHeaderKey": "416734325",
"OrderLineKey": "123416734326",
"OrderReleaseKey": " ",
"RequestedDate": "2020-03-22",
},
]
},
"ShipDates": {
"ShipDate": [
{
"ActualDate": "2019-08-01",
"DateTypeId": "PROMISE_DATE",
"OrderHeaderKey": "416734325",
"OrderLineKey": "123416734326",
"Entn": {
"Ext": [
{
"p": 1,
"q": 2,
},
{
"p": 9,
},
]
},
},
{
"ActualDate": "2020-03-22",
"CommittedDate": "2020-03-22",
"DateTypeId": "SHIPPED_OR_CANCELLED",
"OrderHeaderKey": "416734325",
"OrderLineKey": "123416734326",
},
]
},
}
}
df1 = pd.json_normalize(data["Shipment"]["OrderDates"]["OrderDate"])
df2 = pd.json_normalize(data["Shipment"]["ShipDates"]["ShipDate"])
for i, col in enumerate(
[
"ActualShipmentDate",
"EnterpriseCode",
"EventType",
"ShipmentNo",
"Status",
]
):
df1.insert(i, col, data["Shipment"][col])
df2.insert(i, col, data["Shipment"][col])
df1 = df1.explode("Extn.Ext").reset_index(drop=True)
tmp = df1.pop("Extn.Ext")
df1 = pd.concat(
[df1, tmp[tmp.notna()].reset_index(drop=True).apply(pd.Series)], axis=1
)
df2 = df2.explode("Entn.Ext").reset_index(drop=True)
tmp = df2.pop("Entn.Ext")
df2 = pd.concat(
[df2, tmp[tmp.notna()].reset_index(drop=True).apply(pd.Series)], axis=1
)
print(df1)
print(df2)
Prints:
ActualShipmentDate EnterpriseCode EventType ShipmentNo Status ActualDate DateTypeId OrderHeaderKey OrderLineKey OrderReleaseKey CommittedDate RequestedDate a b c
0 2020-03-22 US001 CONFIRM_SHIPMENT 1001816 1444 2019-08-01 PROMISE_DATE 416734325 123416734326 NaN NaN 1.0 2.0 3.0
1 2020-03-22 US001 CONFIRM_SHIPMENT 1001816 1444 2019-08-01 PROMISE_DATE 416734325 123416734326 NaN NaN 8.0 9.0 NaN
2 2020-03-22 US001 CONFIRM_SHIPMENT 1001816 1444 2020-03-22 SHIPPED_OR_CANCELLED 416734325 123416734326 2020-03-22 2020-03-22 NaN NaN NaN
ActualShipmentDate EnterpriseCode EventType ShipmentNo Status ActualDate DateTypeId OrderHeaderKey OrderLineKey CommittedDate p q
0 2020-03-22 US001 CONFIRM_SHIPMENT 1001816 1444 2019-08-01 PROMISE_DATE 416734325 123416734326 NaN 1.0 2.0
1 2020-03-22 US001 CONFIRM_SHIPMENT 1001816 1444 2019-08-01 PROMISE_DATE 416734325 123416734326 NaN 9.0 NaN
2 2020-03-22 US001 CONFIRM_SHIPMENT 1001816 1444 2020-03-22 SHIPPED_OR_CANCELLED 416734325 123416734326 2020-03-22 NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.