简体   繁体   中英

Generic code to flatten any complicated nested json file using pyspark/pandas

I have a complicated nested json file.i need a generic code which flattens this nested file and stores the result in dataframe using either pyspark or pandas. Is it achievable and is their any generic code which works for any complicated nested json files?

I have added json in data variable. To import json file you can use

df = pd.read_json('data.json')

I have used json_normalize() to flatten nested json data.

Deeply nested JSON structure that can be converted dataframe by passing the meta arguments to the json_normalize function as shown below.

import pandas as pd
data = [
    {
        "company": "Google",
        "tagline": "Hello World",
        "management": {"CEO": "ABC"},
        "department": [
            {"name": "Gmail", "revenue (bn)": 123},
            {"name": "GCP", "revenue (bn)": 400},
            {"name": "Google drive", "revenue (bn)": 600},
        ],
    },
    {
        "company": "Microsoft",
        "tagline": "This is text",
        "management": {"CEO": "XYZ"},
        "department": [
            {"name": "Onedrive", "revenue (bn)": 13},
            {"name": "Azure", "revenue (bn)": 300},
            {"name": "Microsoft 365", "revenue (bn)": 300},
        ],
    },
  
]
df = pd.json_normalize(
    data, "department", ["company", "tagline", ["management", "CEO"]]
)

df

Output

在此处输入图像描述

Refere this article by jssuriyakumar

You can also refer this similar issue by calestini

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM