[英]How to create pandas dataframe from nested json with dictionary
I'm trying to create a pandas dataframe form json file.我正在尝试创建一个 Pandas 数据框形式的 json 文件。 I've seen a multiple solutions to this problem which uses built in functions from_dict/json_normalize yet I'm unable to apply it to my code.
我已经看到了解决这个问题的多种解决方案,它们使用内置函数 from_dict/json_normalize 但我无法将它应用到我的代码中。 Here's how my data is structured in json file:
以下是我的数据在 json 文件中的结构:
"data": [
{
"groups": {
"data": [
{
"group": "Math",
"year_joined": "2009"
},
{
"group_name": "History",
"year_joined": "2011"
},
{
"group_name": "Biology",
"year_joined": "2010"
}
]
},
"id": "12512"
},
When I'm trying to normalize this data with pandas function like this:当我尝试使用 Pandas 函数对这些数据进行规范化时:
path = 'mypath'
f = open(path)
data = json.load(f)
test = pd.json_normalize(
data['data'],
errors='ignore')
I just receive something like this:我只是收到这样的东西:
id groups.data
0 12512 [{'group_name': 'Math', 'year_joined': '2009', 'gr...
1 23172 [{'group_name': 'Chemistry', 'year_joined': '2005'...
I want this data to look like this (solution 1):我希望这些数据看起来像这样(解决方案 1):
id group year_joined
0 12512 group1 year1
1 12512 group2 year2
2 12512 group3 year3
Or like this (solution 2):或者像这样(解决方案2):
id group year_joined
0 12512 group1,group2,group3 year1,year2,year3
1 23172 group4,group5 year4,year5
How can i achieve it?我怎样才能实现它? I tried passing 'record_path' parameter to 'json_normalize' function but it doesn't change anything.
我尝试将 'record_path' 参数传递给 'json_normalize' 函数,但它没有改变任何东西。 I tried to use 'DataFrame.from_dict' function to work around this but I failed.
我尝试使用 'DataFrame.from_dict' 函数来解决这个问题,但我失败了。 The only way I was able to get to solution 1 was to just create multiple loops that iterated through everything in json file and add it to separate list.
我能够获得解决方案 1 的唯一方法是创建多个循环,遍历 json 文件中的所有内容并将其添加到单独的列表中。 It kinda works but takes a lot of time on bigger datasets.
它有点工作,但在更大的数据集上需要很多时间。
How could i use built-in pandas tools to process files which are nested as dictionaries in 3rd layer of the file as presented above?我如何使用内置的 Pandas 工具来处理作为字典嵌套在文件第 3 层的文件,如上所示?
Thank you in advance for all replies预先感谢您的所有回复
explode()
embedded list explode()
嵌入列表apply(pd.Series)
apply(pd.Series)
扩展嵌套字典d = {'groups': {'data': [{'group': 'Math', 'year_joined': '2009'},
{'group_name': 'History', 'year_joined': '2011'},
{'group_name': 'Biology', 'year_joined': '2010'}]},
'id': '12512'}
pd.json_normalize(d).explode("groups.data").reset_index(drop=True).pipe(
lambda d: d["id"].to_frame().join(d["groups.data"].apply(pd.Series))
)
id ![]() |
group![]() |
year_joined![]() |
group_name![]() |
|
---|---|---|---|---|
0 ![]() |
12512 ![]() |
Math![]() |
2009 ![]() |
nan![]() |
1 ![]() |
12512 ![]() |
nan![]() |
2011 ![]() |
History![]() |
2 ![]() |
12512 ![]() |
nan![]() |
2010 ![]() |
Biology![]() |
You need to collect the information from the data
dictionary您需要从
data
字典中收集信息
solution 1解决方案1
d = {}
for group in data["data"]:
groups = [x["group_name"] for x in group['groups']["data"]]
d['id'] = d.get('id', []) + [group['id']] * len(groups)
d['group'] = d.get('group', []) + groups
d['year_joined'] = d.get('year_joined', []) + [x["year_joined"] for x in group['groups']["data"]]
df = pd.DataFrame(d)
Output输出
id group year_joined
0 12512 Math 2009
1 12512 History 2011
2 12512 Biology 2010
3 23172 Chemistry 2007
4 23172 Economics 2008
solution 2解决方案2
d = {}
for group in data["data"]:
d['id'] = d.get('id', []) + [group['id']]
d['group'] = d.get('group', []) + [','.join(x["group_name"] for x in group['groups']["data"])]
d['year_joined'] = d.get('year_joined', []) + [','.join(x["year_joined"] for x in group['groups']["data"])]
df = pd.DataFrame(d)
Output输出
id group year_joined
0 12512 Math,History,Biology 2009,2011,2010
1 23172 Chemistry,Economics 2007,2008
This seems to work for your example:这似乎适用于您的示例:
data = [ # Original data from question
{
"groups": {
"data": [
{
"group": "Math",
"year_joined": "2009"
},
{
"group_name": "History",
"year_joined": "2011"
},
{
"group_name": "Biology",
"year_joined": "2010"
}
]
},
"id": "12512"
},
]
# Use the record_path to extract the list we are interested in, and make sure we retain ID
df = pandas.json_normalize(data, record_path=['groups','data'], meta=['id'])
# Combine the group and group_name columns into a single column as they appear mutually exclusive
df["group"] = df["group_name"].fillna(df["group"])
# Discard the now unnecessary column
df.drop(columns='group_name', inplace=True)
It gives:它给:
year_joined![]() |
group![]() |
id ![]() |
|
---|---|---|---|
0 ![]() |
2009 ![]() |
Math![]() |
12512 ![]() |
1 ![]() |
2011 ![]() |
History![]() |
12512 ![]() |
2 ![]() |
2010 ![]() |
Biology![]() |
12512 ![]() |
To create the second dataframe:创建第二个数据框:
df.groupby(['id']).agg({'year_joined':list,'group':list})
id ![]() |
year_joined![]() |
group![]() |
---|---|---|
12512 ![]() |
['2009', '2011', '2010'] ![]() |
['Math', 'History', 'Biology'] ![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.