简体   繁体   English

使用 Pandas 用嵌套数组展平 JSON

[英]Using Pandas to Flatten a JSON with a nested array

Have the following JSON.有以下JSON。 I want to pullout task flatten it and put into own data frame and include the ID from the parent我想拉出任务将其展平并放入自己的数据框中并包含来自父级的 ID

[
{
"id": 123456,
"assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
"task":[{
         "assignee":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "resolvedBy":{"id":5757,"firstName":"Jim","lastName":"Johnson"},
         "taskId":898989,
         "status":"Closed"
        },
        {
         "assignee":{"id":5857,"firstName":"Nacy","lastName":"Johnson"},
         "resolvedBy":{"id":5857,"firstName":"George","lastName":"Johnson"},
         "taskId":999999
         }
       ],
"state":"Complete"
},
{
"id": 123477,
"assignee":{"id":8576,"firstName":"Jack","lastName":"Johnson"},
"resolvedBy":{"id":null,"firstName":null,"lastName":null},
"task":[],
"state":"Inprogress"
}
]  

I would like to get a dataframe from tasks like so我想从这样的任务中获得 dataframe

id, assignee.id, assignee.firstName, assignee.lastName, resolvedBy.firstName, resolvedBy.lastName, taskId, status

I have flattened the entire dataframe using我已经使用扁平化了整个 dataframe

df=pd.json_normalize(json.loads(df.to_json(orient='records')))

It left tasks in [{}] which I think is okay because I want to pull tasks out into its own dataframe and include the id from the parent.它在 [{}] 中留下了我认为可以的任务,因为我想将任务拉出到它自己的 dataframe 中并包含来自父级的 id。

I have id and tasks in a dataframe like so我在 dataframe 中有 ID 和任务,就像这样

tasksdf=storiesdf[['tasks','id']]

then i want to normalize it like然后我想把它标准化

tasksdf=pd.json_normalize(json.loads(tasksdf.to_json(orient='records')))

but I know since it is in an array I need to do something different.但我知道因为它在一个数组中,所以我需要做一些不同的事情。 However I have not been able to figure it out.但是我一直无法弄清楚。 I have been looking at other examples and reading what others have done.我一直在查看其他示例并阅读其他人所做的事情。 Any help would be appreciated.任何帮助,将不胜感激。

The main problem is that your task record is empty in some cases so it won't appear in your dataframe if you create it with json_normalize.主要问题是您的任务记录在某些情况下是空的,因此如果您使用 json_normalize 创建它,它不会出现在您的 dataframe 中。

Secondly, some columns are redundant between assignee , resolvedBy and the nested task .其次,有些列在assigneeresolvedBy和嵌套task之间是多余的。 I would therefore create the assignee.id , resolved.id ...etc columns first and merge them with the normalized task :因此,我将首先创建assignee.idresolved.id ...等列,并将它们与规范化task合并:

json_data = json.loads(json_str)

df = pd.DataFrame.from_dict(json_data)
df = df.explode('task')

df_assign = pd.DataFrame()
df_assign[["assignee.id", "assignee.firstName", "assignee.lastName"]] = pd.DataFrame(df['assignee'].values.tolist(), index=df.index)
df = df.join(df_assign).drop('assignee', axis=1)

df_resolv = pd.DataFrame()
df_resolv[["resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"]] = pd.DataFrame(df['resolvedBy'].values.tolist(), index=df.index)
df = df.join(df_resolv).drop('resolvedBy', axis=1)

df_task = pd.json_normalize(json_data, record_path='task', meta=['id', 'state'])
df = df.merge(df_task, on=['id', 'state', "assignee.id", "assignee.firstName", "assignee.lastName", "resolvedBy.id", "resolvedBy.firstName", "resolvedBy.lastName"], how="outer").drop('task', axis=1)

print(df.drop_duplicates().reset_index(drop=True))

Output: Output:

         id       state  assignee.id assignee.firstName  ... resolvedBy.firstName  resolvedBy.lastName    taskId  status
0  123456.0    Complete         5757                Jim  ...                  Jim              Johnson  898989.0  Closed
1  123477.0  Inprogress         8576               Jack  ...                 None                 None       NaN     NaN
2    123456    Complete         5857               Nacy  ...               George              Johnson  999999.0     NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM