如何使用字典从嵌套的json创建pandas数据框

Question

I'm trying to create a pandas dataframe form json file.我正在尝试创建一个 Pandas 数据框形式的 json 文件。 I've seen a multiple solutions to this problem which uses built in functions from_dict/json_normalize yet I'm unable to apply it to my code.我已经看到了解决这个问题的多种解决方案，它们使用内置函数 from_dict/json_normalize 但我无法将它应用到我的代码中。 Here's how my data is structured in json file:以下是我的数据在 json 文件中的结构：

     "data": [
   {
      "groups": {
         "data": [
               {
               "group": "Math",
               "year_joined": "2009"
               },
               {
               "group_name": "History",
               "year_joined": "2011"
               },
               {
               "group_name": "Biology",
               "year_joined": "2010"
               }
         ]
      },
      "id": "12512"
   },

When I'm trying to normalize this data with pandas function like this:当我尝试使用 Pandas 函数对这些数据进行规范化时：

path = 'mypath'
f = open(path)
data = json.load(f)

test = pd.json_normalize(
            data['data'], 
            errors='ignore')

I just receive something like this:我只是收到这样的东西：

    id      groups.data
0   12512   [{'group_name': 'Math', 'year_joined': '2009', 'gr...
1   23172   [{'group_name': 'Chemistry', 'year_joined': '2005'...

I want this data to look like this (solution 1):我希望这些数据看起来像这样（解决方案 1）：

    id      group     year_joined
0   12512   group1    year1
1   12512   group2    year2
2   12512   group3    year3

Or like this (solution 2):或者像这样（解决方案2）：

    id      group                   year_joined
0   12512   group1,group2,group3    year1,year2,year3
1   23172   group4,group5           year4,year5

How can i achieve it?我怎样才能实现它？ I tried passing 'record_path' parameter to 'json_normalize' function but it doesn't change anything.我尝试将 'record_path' 参数传递给 'json_normalize' 函数，但它没有改变任何东西。 I tried to use 'DataFrame.from_dict' function to work around this but I failed.我尝试使用 'DataFrame.from_dict' 函数来解决这个问题，但我失败了。 The only way I was able to get to solution 1 was to just create multiple loops that iterated through everything in json file and add it to separate list.我能够获得解决方案 1 的唯一方法是创建多个循环，遍历 json 文件中的所有内容并将其添加到单独的列表中。 It kinda works but takes a lot of time on bigger datasets.它有点工作，但在更大的数据集上需要很多时间。

How could i use built-in pandas tools to process files which are nested as dictionaries in 3rd layer of the file as presented above?我如何使用内置的 Pandas 工具来处理作为字典嵌套在文件第 3 层的文件，如上所示？

Thank you in advance for all replies预先感谢您的所有回复

Answer 1

given you have dict with nested list鉴于你有嵌套列表的字典
1. create dataframe from overall structure从整体结构创建数据框
2. explode() embedded list explode()嵌入列表
3. expand nested dict with apply(pd.Series)使用apply(pd.Series)扩展嵌套字典

d = {'groups': {'data': [{'group': 'Math', 'year_joined': '2009'},
   {'group_name': 'History', 'year_joined': '2011'},
   {'group_name': 'Biology', 'year_joined': '2010'}]},
 'id': '12512'}

pd.json_normalize(d).explode("groups.data").reset_index(drop=True).pipe(
    lambda d: d["id"].to_frame().join(d["groups.data"].apply(pd.Series))
)

	id ID	group团体	year_joined年_加入	group_name团队名字
0 0	12512 12512	Math数学	2009 2009年	nan南
1 1	12512 12512	nan南	2011 2011年	History历史
2 2	12512 12512	nan南	2010 2010年	Biology生物学

Answer 2

You need to collect the information from the data dictionary您需要从data字典中收集信息

solution 1解决方案1

d = {}
for group in data["data"]:
    groups = [x["group_name"] for x in group['groups']["data"]]
    d['id'] = d.get('id', []) + [group['id']] * len(groups)
    d['group'] = d.get('group', []) + groups
    d['year_joined'] = d.get('year_joined', []) + [x["year_joined"] for x in group['groups']["data"]]

df = pd.DataFrame(d)

Output输出

      id      group year_joined
0  12512       Math        2009
1  12512    History        2011
2  12512    Biology        2010
3  23172  Chemistry        2007
4  23172  Economics        2008

solution 2解决方案2

d = {}
for group in data["data"]:
    d['id'] = d.get('id', []) + [group['id']]
    d['group'] = d.get('group', []) + [','.join(x["group_name"] for x in group['groups']["data"])]
    d['year_joined'] = d.get('year_joined', []) + [','.join(x["year_joined"] for x in group['groups']["data"])]

df = pd.DataFrame(d)

Output输出

      id                 group     year_joined
0  12512  Math,History,Biology  2009,2011,2010
1  23172   Chemistry,Economics       2007,2008

Answer 3

This seems to work for your example:这似乎适用于您的示例：

data = [ # Original data from question
   {
      "groups": {
         "data": [
               {
               "group": "Math",
               "year_joined": "2009"
               },
               {
               "group_name": "History",
               "year_joined": "2011"
               },
               {
               "group_name": "Biology",
               "year_joined": "2010"
               }
         ]
      },
      "id": "12512"
   },
]
# Use the record_path to extract the list we are interested in, and make sure we retain ID
df = pandas.json_normalize(data, record_path=['groups','data'], meta=['id'])
# Combine the group and group_name columns into a single column as they appear mutually exclusive
df["group"] = df["group_name"].fillna(df["group"])
# Discard the now unnecessary column
df.drop(columns='group_name', inplace=True)

It gives:它给：

	year_joined年_加入	group团体	id ID
0 0	2009 2009年	Math数学	12512 12512
1 1	2011 2011年	History历史	12512 12512
2 2	2010 2010年	Biology生物学	12512 12512

To create the second dataframe:创建第二个数据框：

df.groupby(['id']).agg({'year_joined':list,'group':list})

id ID	year_joined年_加入	group团体
12512 12512	['2009', '2011', '2010'] ['2009'、'2011'、'2010']	['Math', 'History', 'Biology'] ['数学'、'历史'、'生物学']

如何使用字典从嵌套的json创建pandas数据框

问题描述

3 个解决方案

解决方案1
2 2021-07-20 13:06:40

解决方案2
1 已采纳 2021-07-20 12:59:20

解决方案3
0 2021-07-20 13:04:20

如何使用字典从嵌套的json创建pandas数据框

问题描述

3 个解决方案

解决方案1 2 2021-07-20 13:06:40

解决方案2 1 已采纳 2021-07-20 12:59:20

解决方案3 0 2021-07-20 13:04:20

解决方案1
2 2021-07-20 13:06:40

解决方案2
1 已采纳 2021-07-20 12:59:20

解决方案3
0 2021-07-20 13:04:20