简体   繁体   English

通过 pandas 将复杂的嵌套 json 转换为 csv

[英]Converting complex nested json to csv via pandas

I have the following json file我有以下 json 文件

{
    "matches": [
        {
            "team": "Sunrisers Hyderabad",
            "overallResult": "Won",
            "totalMatches": 3,
            "margins": [
                {
                    "bar": 290
                },
                {
                    "bar": 90
                }
            ]
        },
        {
            "team": "Pune Warriors",
            "overallResult": "None",
            "totalMatches": 0,
            "margins": null
        }
    ],
    "totalMatches": 70
}

Note - Above json is fragment of original json.注意 - 上面的 json 是原始 json 的片段。 The actual file contains lot more attributes after 'margins', some of them nested and others not so.实际文件在“边距”之后包含更多属性,其中一些是嵌套的,而另一些则不是。 I just put some for brevity and to give an idea of expectations.我只是为了简洁起见并给出期望的概念。

My goal is to flatten the data and load it into CSV.我的目标是展平数据并将其加载到 CSV 中。 Here is the code I have written so far -这是我到目前为止编写的代码 -

import json
import pandas as pd

path = r"/Users/samt/Downloads/test_data.json"

with open(path) as f:
    t_data = {}
    data = json.load(f)
    for team in data['matches']:
        if team['margins']:
            for idx, margin in enumerate(team['margins']):
                t_data['team'] = team['team']
                t_data['overallResult'] = team['overallResult']
                t_data['totalMatches'] = team['totalMatches']
                t_data['margin'] = margin.get('bar')
        else:
            t_data['team'] = team['team']
            t_data['overallResult'] = team['overallResult']
            t_data['totalMatches'] = team['totalMatches']
            t_data['margin'] = margin.get('bar')

    df = pd.DataFrame.from_dict(t_data, orient='index')
    print(df)            

I know that data is getting over-written and loop is not properly structured.I am bit new to dealing with JSON objects using Python and I am not able to understand how to concate the results.我知道数据被覆盖并且循环结构不正确。我对使用 Python 处理 JSON 对象有点陌生,我无法理解如何连接结果。

My goal is once, all the results are appended, use to_csv and convert them into rows.我的目标是一次,附加所有结果,使用 to_csv 并将它们转换为行。 For each margin, the entire data is to be replicated as a seperate row.对于每个边距,整个数据将被复制为单独的行。 Here is what I am expecting the output to be.这是我期望的输出。 Can someone please help how to translate this?有人可以帮忙翻译一下吗?

From whatever I find on the net, it is about first gathering the dictionary items but how to transpose it to rows is something I am not able to understand.从我在网上找到的任何内容来看,它是关于首先收集字典项目,但如何将其转换为行是我无法理解的。 Also, is there a better way to parse the json than doing the loop twice for one attribute ie margins?此外,有没有比为一个属性(即边距)循环两次更好的方法来解析 json?

I can't use json_normalize as that library is not supported in our environment.我不能使用 json_normalize 因为我们的环境不支持该库。

[output data] [输出数据]

1

You can use pd.DataFrame to create DataFrame and explode the margins column您可以使用pd.DataFrame创建 DataFrame 并展开margins

import json
import pandas as pd

with open('data.json', 'r', encoding='utf-8') as f:
    data = json.loads(f.read())

df = pd.DataFrame(data['matches']).explode('margins', ignore_index=True)
print(df)

                  team overallResult  totalMatches       margins
0  Sunrisers Hyderabad           Won             3  {'bar': 290}
1  Sunrisers Hyderabad           Won             3   {'bar': 90}
2        Pune Warriors          None             0          None

Then fill the None value in margins column to dictionary and convert it to column然后将margins列中的None值填充到字典并将其转换为列

bar = df['margins'].apply(lambda x: x if x else {'bar': pd.NA}).apply(pd.Series)
print(bar)

    bar
0   290
1    90
2  <NA>

At last, join the Series to original dataframe最后,将系列加入原始数据框

df = df.join(bar).drop(columns='margins')
print(df)

                  team overallResult  totalMatches   bar
0  Sunrisers Hyderabad           Won             3   290
1  Sunrisers Hyderabad           Won             3    90
2        Pune Warriors          None             0  <NA>

Using the json and csv modules: create a dictionary for each team, for each margin if there is one.使用 json 和 csv 模块:为每个团队创建一个字典,如果有一个边距,则为每个边距

import json, csv

s = '''{
    "matches": [
        {
            "team": "Sunrisers Hyderabad",
            "overallResult": "Won",
            "totalMatches": 3,
            "margins": [
                {
                    "bar": 290
                },
                {
                    "bar": 90
                }
            ]
        },
        {
            "team": "Pune Warriors",
            "overallResult": "None",
            "totalMatches": 0,
            "margins": null
        }
    ],
    "totalMatches": 70
}'''

j = json.loads(s)

matches = j['matches']
rows = []
for thing in matches:
    # print(thing)
    if not thing['margins']:
        rows.append(thing)
    else:
        for bar in (b['bar'] for b in thing['margins']):
            d = dict((k,thing[k]) for k in ('team','overallResult','totalMatches'))
            d['margins'] = bar
            rows.append(d)

# for row in rows: print(row)            

# using an in-memory stream for this example instead of an actual file
import io
f = io.StringIO(newline='')

fieldnames=('team','overallResult','totalMatches','margins')
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
f.seek(0)
print(f.read())

team,overallResult,totalMatches,margins
Sunrisers Hyderabad,Won,3,290
Sunrisers Hyderabad,Won,3,90
Pune Warriors,None,0,

Getting multiple item values from a dictionary can be aided by using operator.itemgetter()使用operator.itemgetter()可以帮助从字典中获取多个项目值

>>> import operator
>>> items = operator.itemgetter(*('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter('team','overallResult','totalMatches')
>>> #stuff = ('team','overallResult','totalMatches'))
>>> #items = operator.itemgetter(*stuff)
>>> d = {'margins': 90,
...   'overallResult': 'Won',
...   'team': 'Sunrisers Hyderabad',
...   'totalMatches': 3}
>>> items(d)
('Sunrisers Hyderabad', 'Won', 3)
>>>

I like to use use it and give the callable a descriptive name but I don't see it used much here on SO.我喜欢使用它并给可调用对象一个描述性的名称,但我认为它在 SO 上的使用并不多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM