Using Pandas to Flatten a JSON with a nested array

Question

Have the following JSON. I want to pullout task flatten it and put into own data frame and include the ID from the parent

[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
         "assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "taskId":898989,
         "status":"Closed"
        },
        {
         "assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
         "resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
         "taskId":999999
         }
       ],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]

I would like to get a dataframe from tasks like so

id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status

I have flattened the entire dataframe using

df=pd.json_normalize(json.loads(df.to_json(orient='records')))

It left tasks in [{}] which I think is okay because I want to pull tasks out into its own dataframe and include the id from the parent.

I have id and tasks in a dataframe like so

tasksdf=storiesdf[['tasks','id']]

then i want to normalize it like

tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))

but I know since it is in an array I need to do something different. However I have not been able to figure it out. I have been looking at other examples and reading what others have done. Any help would be appreciated.

Answer 1

The main problem is that your task record is empty in some cases so it won't appear in your dataframe if you create it with json_normalize.

Secondly, some columns are redundant between assignee , resolvedBy and the nested task . I would therefore create the assignee.id , resolved.id ...etc columns first and merge them with the normalized task :

json_data = json.loads(json_str)

df = pd.DataFrame.from_dict(json_data)
df = df.explode('task')

df_assign = pd.DataFrame()
df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
df = df.join(df_assign).drop('assignee', axis=1)

df_resolv = pd.DataFrame()
df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
df = df.join(df_resolv).drop('resolvedBy', axis=1)

df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)

print(df.drop_duplicates().reset_index(drop=True))

Output:

         id       state  assignee.id assignee.firstName  ... resolvedBy.firstName  resolvedBy.lastName    taskId  status
0  123456.0    Complete         5757                Jim  ...                  Jim              Johnson  898989.0  Closed
1  123477.0  Inprogress         8576               Jack  ...                 None                 None       NaN     NaN
2    123456    Complete         5857               Nacy  ...               George              Johnson  999999.0     NaN

Using Pandas to Flatten a JSON with a nested array

Question

1 answers

solution1
0 2021-11-21 09:58:05

Using Pandas to Flatten a JSON with a nested array

Question

1 answers

solution1 0 2021-11-21 09:58:05

solution1
0 2021-11-21 09:58:05