[英]Using Pandas to Flatten a JSON with a nested array
Have the following JSON.有以下JSON。 I want to pullout task flatten it and put into own data frame and include the ID from the parent
我想拉出任务将其展平并放入自己的数据框中并包含来自父级的 ID
[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"taskId":898989,
"status":"Closed"
},
{
"assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
"resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
"taskId":999999
}
],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]
I would like to get a dataframe from tasks like so我想从这样的任务中获得 dataframe
id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status
I have flattened the entire dataframe using我已经使用扁平化了整个 dataframe
df=pd.json_normalize(json.loads(df.to_json(orient='records')))
It left tasks in [{}] which I think is okay because I want to pull tasks out into its own dataframe and include the id from the parent.它在 [{}] 中留下了我认为可以的任务,因为我想将任务拉出到它自己的 dataframe 中并包含来自父级的 id。
I have id and tasks in a dataframe like so我在 dataframe 中有 ID 和任务,就像这样
tasksdf=storiesdf[['tasks','id']]
then i want to normalize it like然后我想把它标准化
tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))
but I know since it is in an array I need to do something different.但我知道因为它在一个数组中,所以我需要做一些不同的事情。 However I have not been able to figure it out.
但是我一直无法弄清楚。 I have been looking at other examples and reading what others have done.
我一直在查看其他示例并阅读其他人所做的事情。 Any help would be appreciated.
任何帮助,将不胜感激。
The main problem is that your task record is empty in some cases so it won't appear in your dataframe if you create it with json_normalize.主要问题是您的任务记录在某些情况下是空的,因此如果您使用 json_normalize 创建它,它不会出现在您的 dataframe 中。
Secondly, some columns are redundant between assignee
, resolvedBy
and the nested task
.其次,有些列在
assignee
、 resolvedBy
和嵌套task
之间是多余的。 I would therefore create the assignee.id
, resolved.id
...etc columns first and merge them with the normalized task
:因此,我将首先创建
assignee.id
、 resolved.id
...等列,并将它们与规范化task
合并:
json_data = json.loads(json_str)
df = pd.DataFrame.from_dict(json_data)
df = df.explode('task')
df_assign = pd.DataFrame()
df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
df = df.join(df_assign).drop('assignee', axis=1)
df_resolv = pd.DataFrame()
df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
df = df.join(df_resolv).drop('resolvedBy', axis=1)
df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)
print(df.drop_duplicates().reset_index(drop=True))
Output: Output:
id state assignee.id assignee.firstName ... resolvedBy.firstName resolvedBy.lastName taskId status
0 123456.0 Complete 5757 Jim ... Jim Johnson 898989.0 Closed
1 123477.0 Inprogress 8576 Jack ... None None NaN NaN
2 123456 Complete 5857 Nacy ... George Johnson 999999.0 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.