使用 Pandas 用嵌套数组展平 JSON

Question

Have the following JSON.有以下JSON。 I want to pullout task flatten it and put into own data frame and include the ID from the parent我想拉出任务将其展平并放入自己的数据框中并包含来自父级的 ID

[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
         "assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "taskId":898989,
         "status":"Closed"
        },
        {
         "assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
         "resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
         "taskId":999999
         }
       ],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]

I would like to get a dataframe from tasks like so我想从这样的任务中获得 dataframe

id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status

I have flattened the entire dataframe using我已经使用扁平化了整个 dataframe

df=pd.json_normalize(json.loads(df.to_json(orient='records')))

It left tasks in [{}] which I think is okay because I want to pull tasks out into its own dataframe and include the id from the parent.它在 [{}] 中留下了我认为可以的任务，因为我想将任务拉出到它自己的 dataframe 中并包含来自父级的 id。

I have id and tasks in a dataframe like so我在 dataframe 中有 ID 和任务，就像这样

tasksdf=storiesdf[['tasks','id']]

then i want to normalize it like然后我想把它标准化

tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))

but I know since it is in an array I need to do something different.但我知道因为它在一个数组中，所以我需要做一些不同的事情。 However I have not been able to figure it out.但是我一直无法弄清楚。 I have been looking at other examples and reading what others have done.我一直在查看其他示例并阅读其他人所做的事情。 Any help would be appreciated.任何帮助，将不胜感激。

Answer 1

The main problem is that your task record is empty in some cases so it won't appear in your dataframe if you create it with json_normalize.主要问题是您的任务记录在某些情况下是空的，因此如果您使用 json_normalize 创建它，它不会出现在您的 dataframe 中。

Secondly, some columns are redundant between assignee , resolvedBy and the nested task .其次，有些列在assignee 、 resolvedBy和嵌套task之间是多余的。 I would therefore create the assignee.id , resolved.id ...etc columns first and merge them with the normalized task :因此，我将首先创建assignee.id 、 resolved.id ...等列，并将它们与规范化task合并：

json_data = json.loads(json_str)

df = pd.DataFrame.from_dict(json_data)
df = df.explode('task')

df_assign = pd.DataFrame()
df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
df = df.join(df_assign).drop('assignee', axis=1)

df_resolv = pd.DataFrame()
df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
df = df.join(df_resolv).drop('resolvedBy', axis=1)

df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)

print(df.drop_duplicates().reset_index(drop=True))

Output: Output：

         id       state  assignee.id assignee.firstName  ... resolvedBy.firstName  resolvedBy.lastName    taskId  status
0  123456.0    Complete         5757                Jim  ...                  Jim              Johnson  898989.0  Closed
1  123477.0  Inprogress         8576               Jack  ...                 None                 None       NaN     NaN
2    123456    Complete         5857               Nacy  ...               George              Johnson  999999.0     NaN

使用 Pandas 用嵌套数组展平 JSON

问题描述

1 个解决方案

解决方案1
0 2021-11-21 09:58:05

使用 Pandas 用嵌套数组展平 JSON

问题描述

1 个解决方案

解决方案1 0 2021-11-21 09:58:05

解决方案1
0 2021-11-21 09:58:05