简体   繁体   中英

How do I create separate dataframes or CSVs from a nested JSON with unrelated branches?

I have nested JSON files with multiple separate branches that can only be joined through the information at the top of the branches. I do not want to cross-join the rows from the different branches. The braches can have inside them lists and dictionaries which can further have other lists and dictionaries.

Following is a sample json file. I have 35 different files with different structures. I want to create separate flat files for each branch that will be stored in separate folders. Later on, data from these files will be processed and queried.

"Shipment": {
    "ActualShipmentDate": "2020-03-22",
    "EnterpriseCode": "US001",
    "EventType": "CONFIRM_SHIPMENT",
    "ShipmentNo": "1001816",
    "Status": "1444",
    "OrderDates": {
        "OrderDate": [{
                "ActualDate": "2019-08-01",
                "DateTypeId": "PROMISE_DATE",
                "OrderHeaderKey": "416734325",
                "OrderLineKey": "123416734326",
                "OrderReleaseKey": "",
                "Extn": {
                    "Ext": [{
                            "a": 1,
                            "b": 2,
                            "c": 3
                        }, {
                            "a": 8,
                            "b": 9
                        }
                    ]
                }
            }, {
                "ActualDate": "2020-03-22",
                "CommittedDate": "2020-03-22",
                "DateTypeId": "SHIPPED_OR_CANCELLED",
                "OrderHeaderKey": "416734325",
                "OrderLineKey": "123416734326",
                "OrderReleaseKey": " ",
                "RequestedDate": "2020-03-22"
            }
        ]
    },
    "ShipDates": {
        "ShipDate": [{
                "ActualDate": "2019-08-01",
                "DateTypeId": "PROMISE_DATE",
                "OrderHeaderKey": "416734325",
                "OrderLineKey": "123416734326",
                "Entn": {
                    "Ext": [{
                            "p": 1,
                            "q": 2,
                        }, {
                            "p": 9,
                        }
                    ]
                }
            }, {
                "ActualDate": "2020-03-22",
                "CommittedDate": "2020-03-22",
                "DateTypeId": "SHIPPED_OR_CANCELLED",
                "OrderHeaderKey": "416734325",
                "OrderLineKey": "123416734326",
            }
        ]
    }
}

The tree structure of the above sample json file is in this image:这个图片 .

How can I get separate structures in python like in this image:这个图片

I'm trying to do this either in an AWS Lambda function or a Glue job.

Thanks a lot in advance for your help.

You can try (but rename the final columns as you need):

data = {
    "Shipment": {
        "ActualShipmentDate": "2020-03-22",
        "EnterpriseCode": "US001",
        "EventType": "CONFIRM_SHIPMENT",
        "ShipmentNo": "1001816",
        "Status": "1444",
        "OrderDates": {
            "OrderDate": [
                {
                    "ActualDate": "2019-08-01",
                    "DateTypeId": "PROMISE_DATE",
                    "OrderHeaderKey": "416734325",
                    "OrderLineKey": "123416734326",
                    "OrderReleaseKey": "",
                    "Extn": {
                        "Ext": [{"a": 1, "b": 2, "c": 3}, {"a": 8, "b": 9}]
                    },
                },
                {
                    "ActualDate": "2020-03-22",
                    "CommittedDate": "2020-03-22",
                    "DateTypeId": "SHIPPED_OR_CANCELLED",
                    "OrderHeaderKey": "416734325",
                    "OrderLineKey": "123416734326",
                    "OrderReleaseKey": " ",
                    "RequestedDate": "2020-03-22",
                },
            ]
        },
        "ShipDates": {
            "ShipDate": [
                {
                    "ActualDate": "2019-08-01",
                    "DateTypeId": "PROMISE_DATE",
                    "OrderHeaderKey": "416734325",
                    "OrderLineKey": "123416734326",
                    "Entn": {
                        "Ext": [
                            {
                                "p": 1,
                                "q": 2,
                            },
                            {
                                "p": 9,
                            },
                        ]
                    },
                },
                {
                    "ActualDate": "2020-03-22",
                    "CommittedDate": "2020-03-22",
                    "DateTypeId": "SHIPPED_OR_CANCELLED",
                    "OrderHeaderKey": "416734325",
                    "OrderLineKey": "123416734326",
                },
            ]
        },
    }
}

df1 = pd.json_normalize(data["Shipment"]["OrderDates"]["OrderDate"])
df2 = pd.json_normalize(data["Shipment"]["ShipDates"]["ShipDate"])

for i, col in enumerate(
    [
        "ActualShipmentDate",
        "EnterpriseCode",
        "EventType",
        "ShipmentNo",
        "Status",
    ]
):
    df1.insert(i, col, data["Shipment"][col])
    df2.insert(i, col, data["Shipment"][col])


df1 = df1.explode("Extn.Ext").reset_index(drop=True)
tmp = df1.pop("Extn.Ext")
df1 = pd.concat(
    [df1, tmp[tmp.notna()].reset_index(drop=True).apply(pd.Series)], axis=1
)

df2 = df2.explode("Entn.Ext").reset_index(drop=True)
tmp = df2.pop("Entn.Ext")
df2 = pd.concat(
    [df2, tmp[tmp.notna()].reset_index(drop=True).apply(pd.Series)], axis=1
)

print(df1)
print(df2)

Prints:

  ActualShipmentDate EnterpriseCode         EventType ShipmentNo Status  ActualDate            DateTypeId OrderHeaderKey  OrderLineKey OrderReleaseKey CommittedDate RequestedDate    a    b    c
0         2020-03-22          US001  CONFIRM_SHIPMENT    1001816   1444  2019-08-01          PROMISE_DATE      416734325  123416734326                           NaN           NaN  1.0  2.0  3.0
1         2020-03-22          US001  CONFIRM_SHIPMENT    1001816   1444  2019-08-01          PROMISE_DATE      416734325  123416734326                           NaN           NaN  8.0  9.0  NaN
2         2020-03-22          US001  CONFIRM_SHIPMENT    1001816   1444  2020-03-22  SHIPPED_OR_CANCELLED      416734325  123416734326                    2020-03-22    2020-03-22  NaN  NaN  NaN

  ActualShipmentDate EnterpriseCode         EventType ShipmentNo Status  ActualDate            DateTypeId OrderHeaderKey  OrderLineKey CommittedDate    p    q
0         2020-03-22          US001  CONFIRM_SHIPMENT    1001816   1444  2019-08-01          PROMISE_DATE      416734325  123416734326           NaN  1.0  2.0
1         2020-03-22          US001  CONFIRM_SHIPMENT    1001816   1444  2019-08-01          PROMISE_DATE      416734325  123416734326           NaN  9.0  NaN
2         2020-03-22          US001  CONFIRM_SHIPMENT    1001816   1444  2020-03-22  SHIPPED_OR_CANCELLED      416734325  123416734326    2020-03-22  NaN  NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM