Have the following JSON. I want to pullout task flatten it and put into own data frame and include the ID from the parent
[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"taskId":898989,
"status":"Closed"
},
{
"assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
"resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
"taskId":999999
}
],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]
I would like to get a dataframe from tasks like so
id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status
I have flattened the entire dataframe using
df=pd.json_normalize(json.loads(df.to_json(orient='records')))
It left tasks in [{}] which I think is okay because I want to pull tasks out into its own dataframe and include the id from the parent.
I have id and tasks in a dataframe like so
tasksdf=storiesdf[['tasks','id']]
then i want to normalize it like
tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))
but I know since it is in an array I need to do something different. However I have not been able to figure it out. I have been looking at other examples and reading what others have done. Any help would be appreciated.
The main problem is that your task record is empty in some cases so it won't appear in your dataframe if you create it with json_normalize.
Secondly, some columns are redundant between assignee
, resolvedBy
and the nested task
. I would therefore create the assignee.id
, resolved.id
...etc columns first and merge them with the normalized task
:
json_data = json.loads(json_str)
df = pd.DataFrame.from_dict(json_data)
df = df.explode('task')
df_assign = pd.DataFrame()
df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
df = df.join(df_assign).drop('assignee', axis=1)
df_resolv = pd.DataFrame()
df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
df = df.join(df_resolv).drop('resolvedBy', axis=1)
df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)
print(df.drop_duplicates().reset_index(drop=True))
Output:
id state assignee.id assignee.firstName ... resolvedBy.firstName resolvedBy.lastName taskId status
0 123456.0 Complete 5757 Jim ... Jim Johnson 898989.0 Closed
1 123477.0 Inprogress 8576 Jack ... None None NaN NaN
2 123456 Complete 5857 Nacy ... George Johnson 999999.0 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.