简体   繁体   中英

Parse JSON nested array in Python, keep the mapping to the Json object

I've got a large JSON file with the following structure:

    {
    "Project": {
        "AAA": {
            "Version": [
                {
                    "id": "00001",
                    "name": "08.12.2019",
                    "description": null,
                    "released": true,
                    "releaseDate": "2019-08-12"
                },
                {
                    "id": "00002",
                    "name": "2019.8.26",
                    "description": null,
                    "released": true,
                    "releaseDate": "2019-08-26"
                }
            ]
        },
        "BBB": {
            "Version": [
                {
                    "id": "00003",
                    "name": "AABBY3",
                    "description": "2019 Accounting Year End",
                    "released": false,
                    "releaseDate": null
                },
                {
                    "id": "00004",
                    "name": "AACCZ4",
                    "description": "Financial Statements 2019",
                    "released": false,
                    "releaseDate": null
                },
                {
                    "id": "00005",
                    "name": "AADDZ5",
                    "description": null,
                    "released": false,
                    "releaseDate": null
                }
            ]
        }
    }
}

I'm having a problem converting this into a Python dataframe due to the nested array. How can I extract all the data in each Version for each Project , but maintaining a reference to Project ?

I've so far only managed to get a dataframe of the following structure:

df.head(3)
Out[10]: 
      description     id    name releaseDate  released
0  Version 5.4.1.  10703  V5R4M1  2010-09-15      True
1   Version 5.5.1  10704  V5R5M1  2015-04-20      True
2   Version 6.1.1  10705  V6R1M1  2016-10-14      True

using the following:

with open("fixVer2.json", "r") as read_file:
    data = json.load(read_file)

prj_list = ['AAA', 'BBB', 'CCC', 'DDD']

d_list = []
for x in prj_list:
    d = data['Project'][x]['Version']
    for el in d:
        d_list.append(el)

df = pd.DataFrame(d_list)

but the due to some duplicate names across projects with different releaseDates , I need to keep the Project name to identify the correct releaseDate for each name

desired output:

      description     id    name releaseDate  released  Project
0  Version 5.4.1.  10703  V5R4M1  2010-09-15      True  CCC
1   Version 5.5.1  10704  V5R5M1  2015-04-20      True  CCC
2   Version 6.1.1  10705  V6R1M1  2016-10-14      True  CCC

I am unsure how I can parse the nested array, keep the Project name detail and consolidate all of it into one dataframe/other Python structure

You can change append with added version in your solution:

d_list = []
for x in prj_list:
    d = data['Project'][x]['Version']
    for el in d:
        el['Project'] = x
        d_list.append(el)

Or use list comprehension:

prj_list = ['AAA', 'BBB']

d_list = [{**el, **{'version': x}} for x in prj_list for el in data['Project'][x]['Version']]
df = pd.DataFrame(d_list)
print (df)
      id        name                description  released releaseDate version
0  00001  08.12.2019                       null      True  2019-08-12     AAA
1  00002   2019.8.26                       null      True  2019-08-26     AAA
2  00003      AABBY3   2019 Accounting Year End     False        null     BBB
3  00004      AACCZ4  Financial Statements 2019     False        null     BBB
4  00005      AADDZ5                       null     False        null     BBB

Try this:

import json
import pandas as pd

with open("test.json", "r") as read_file:
    data = json.load(read_file)['Project']
d_list = []
for name,dat in data.items():
    for d in dat['Version']:
        d['Project']=name
        d_list.append(d)
df = pd.DataFrame(d_list)
print(df)

Project                description     id        name releaseDate  released
0     AAA                       None  00001  08.12.2019  2019-08-12      True
1     AAA                       None  00002   2019.8.26  2019-08-26      True
2     BBB   2019 Accounting Year End  00003      AABBY3        None     False
3     BBB  Financial Statements 2019  00004      AACCZ4        None     False
4     BBB                       None  00005      AADDZ5        None     False

With this approach, you don't need to keep a separate list of projects. Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM