简体   繁体   中英

Using Pandas to Flatten a JSON with a nested array

Have the following JSON. I want to pullout task flatten it and put into own data frame and include the ID from the parent

[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
         "assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "taskId":898989,
         "status":"Closed"
        },
        {
         "assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
         "resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
         "taskId":999999
         }
       ],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]  

I would like to get a dataframe from tasks like so

id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status

I have flattened the entire dataframe using

df=pd.json_normalize(json.loads(df.to_json(orient='records')))

It left tasks in [{}] which I think is okay because I want to pull tasks out into its own dataframe and include the id from the parent.

I have id and tasks in a dataframe like so

tasksdf=storiesdf[['tasks','id']]

then i want to normalize it like

tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))

but I know since it is in an array I need to do something different. However I have not been able to figure it out. I have been looking at other examples and reading what others have done. Any help would be appreciated.

The main problem is that your task record is empty in some cases so it won't appear in your dataframe if you create it with json_normalize.

Secondly, some columns are redundant between assignee , resolvedBy and the nested task . I would therefore create the assignee.id , resolved.id ...etc columns first and merge them with the normalized task :

json_data = json.loads(json_str)

df = pd.DataFrame.from_dict(json_data)
df = df.explode('task')

df_assign = pd.DataFrame()
df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
df = df.join(df_assign).drop('assignee', axis=1)

df_resolv = pd.DataFrame()
df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
df = df.join(df_resolv).drop('resolvedBy', axis=1)

df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)

print(df.drop_duplicates().reset_index(drop=True))

Output:

         id       state  assignee.id assignee.firstName  ... resolvedBy.firstName  resolvedBy.lastName    taskId  status
0  123456.0    Complete         5757                Jim  ...                  Jim              Johnson  898989.0  Closed
1  123477.0  Inprogress         8576               Jack  ...                 None                 None       NaN     NaN
2    123456    Complete         5857               Nacy  ...               George              Johnson  999999.0     NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM