[英]Converting complex nested json to csv via pandas
I have the following json file我有以下 json 文件
{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}
Note - Above json is fragment of original json.注意 - 上面的 json 是原始 json 的片段。 The actual file contains lot more attributes after 'margins', some of them nested and others not so.
实际文件在“边距”之后包含更多属性,其中一些是嵌套的,而另一些则不是。 I just put some for brevity and to give an idea of expectations.
我只是为了简洁起见并给出期望的概念。
My goal is to flatten the data and load it into CSV.我的目标是展平数据并将其加载到 CSV 中。 Here is the code I have written so far -
这是我到目前为止编写的代码 -
import json
import pandas as pd
path = r"/Users/samt/Downloads/test_data.json"
with open(path) as f:
t_data = {}
data = json.load(f)
for team in data['matches']:
if team['margins']:
for idx, margin in enumerate(team['margins']):
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
else:
t_data['team'] = team['team']
t_data['overallResult'] = team['overallResult']
t_data['totalMatches'] = team['totalMatches']
t_data['margin'] = margin.get('bar')
df = pd.DataFrame.from_dict(t_data, orient='index')
print(df)
I know that data is getting over-written and loop is not properly structured.I am bit new to dealing with JSON objects using Python and I am not able to understand how to concate the results.我知道数据被覆盖并且循环结构不正确。我对使用 Python 处理 JSON 对象有点陌生,我无法理解如何连接结果。
My goal is once, all the results are appended, use to_csv and convert them into rows.我的目标是一次,附加所有结果,使用 to_csv 并将它们转换为行。 For each margin, the entire data is to be replicated as a seperate row.
对于每个边距,整个数据将被复制为单独的行。 Here is what I am expecting the output to be.
这是我期望的输出。 Can someone please help how to translate this?
有人可以帮忙翻译一下吗?
From whatever I find on the net, it is about first gathering the dictionary items but how to transpose it to rows is something I am not able to understand.从我在网上找到的任何内容来看,它是关于首先收集字典项目,但如何将其转换为行是我无法理解的。 Also, is there a better way to parse the json than doing the loop twice for one attribute ie margins?
此外,有没有比为一个属性(即边距)循环两次更好的方法来解析 json?
I can't use json_normalize as that library is not supported in our environment.我不能使用 json_normalize 因为我们的环境不支持该库。
[output data] [输出数据]
You can use pd.DataFrame
to create DataFrame and explode the margins
column您可以使用
pd.DataFrame
创建 DataFrame 并展开margins
列
import json
import pandas as pd
with open('data.json', 'r', encoding='utf-8') as f:
data = json.loads(f.read())
df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)
print(df)
team overallResult totalMatches margins
0 Sunrisers Hyderabad Won 3 {'bar': 290}
1 Sunrisers Hyderabad Won 3 {'bar': 90}
2 Pune Warriors None 0 None
Then fill the None
value in margins
column to dictionary and convert it to column然后将
margins
列中的None
值填充到字典并将其转换为列
bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)
print(bar)
bar
0 290
1 90
2 <NA>
At last, join the Series to original dataframe最后,将系列加入原始数据框
df = df.join(bar).drop(columns='margins')
print(df)
team overallResult totalMatches bar
0 Sunrisers Hyderabad Won 3 290
1 Sunrisers Hyderabad Won 3 90
2 Pune Warriors None 0 <NA>
Using the json and csv modules: create a dictionary for each team, for each margin if there is one.使用 json 和 csv 模块:为每个团队创建一个字典,如果有一个边距,则为每个边距。
import json, csv
s = '''{
"matches": [
{
"team": "Sunrisers Hyderabad",
"overallResult": "Won",
"totalMatches": 3,
"margins": [
{
"bar": 290
},
{
"bar": 90
}
]
},
{
"team": "Pune Warriors",
"overallResult": "None",
"totalMatches": 0,
"margins": null
}
],
"totalMatches": 70
}'''
j = json.loads(s)
matches = j['matches']
rows = []
for thing in matches:
# print(thing)
if not thing['margins']:
rows.append(thing)
else:
for bar in (b['bar'] for b in thing['margins']):
d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
d['margins'] = bar
rows.append(d)
# for row in rows: print(row)
# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')
fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())
team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,
Getting multiple item values from a dictionary can be aided by using operator.itemgetter()使用operator.itemgetter()可以帮助从字典中获取多个项目值
>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
... 'overallResult': 'Won',
... 'team': 'Sunrisers Hyderabad',
... 'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>
I like to use use it and give the callable a descriptive name but I don't see it used much here on SO.我喜欢使用它并给可调用对象一个描述性的名称,但我认为它在 SO 上的使用并不多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.