简体   繁体   English

如何从 json 数据创建 DataFrame - 数组内的字典、列表和 arrays

[英]How to create DataFrame from json data - dicts, lists and arrays within an array

I'm not able to get the data but only the headers from json data我无法获取数据,但只能获取 json 数据中的标头

Have tried to use json_normalize which creates a DataFrame from json data, but when I try to loop and append data the result is that I only get the headers.曾尝试使用 json_normalize 从 json 数据创建 DataFrame ,但是当我尝试循环和 append 数据时,结果是我只得到标题。

import pandas as pd
import json
import requests
from pandas.io.json import json_normalize
import numpy as np

# importing json data

def get_json(file_path):
    r = requests.get('https://www.atg.se/services/racinginfo/v1/api/games/V75_2019-09-29_5_6')
    jsonResponse = r.json()
    with open(file_path, 'w', encoding='utf-8') as outfile:
        json.dump(jsonResponse, outfile, ensure_ascii=False, indent=None)

# Run the function and choose where to save the json file
get_json('../trav.json')

# Open the json file and print a list of the keys
with open('../trav.json', 'r') as json_data:
    d = json.load(json_data)

    print(list(d.keys()))

[Out]:
['@type', 'id', 'status', 'pools', 'races', 'currentVersion']

To get all data for the starts in one race I can use json_normalize function要在一场比赛中获得所有开始数据,我可以使用 json_normalize function

race_1_starts = json_normalize(d['races'][0]['starts'])
race_1_starts_df = race_1_starts.drop('videos', axis=1)
print(race_1_starts_df)

[Out]:
    distance  driver.birth  ... result.prizeMoney  result.startNumber
0       1640          1984  ...             62500                   1
1       1640          1976  ...             11000                   2
2       1640          1968  ...               500                   3
3       1640          1953  ...            250000                   4
4       1640          1968  ...               500                   5
5       1640          1962  ...             18500                   6
6       1640          1961  ...              7000                   7
7       1640          1989  ...             31500                   8
8       1640          1960  ...               500                   9
9       1640          1954  ...               500                  10
10      1640          1977  ...            125000                  11
11      1640          1977  ...               500                  12

Above we get a DataFrame with data on all starts from one race.上面我们得到一个 DataFrame ,其中包含一场比赛的所有开始数据。 However, when I try to loop through all races in range in order to get data on all starts for all races, then I only get the headers from each race and not the data on starts for each race:但是,当我尝试遍历范围内的所有比赛以获取所有比赛的所有开始数据时,我只获得每场比赛的标题,而不是每场比赛的开始数据:


all_starts = []

for t in range(len(d['races'])):

    all_starts.append([t+1, json_normalize(d['races'][t]['starts'])])

all_starts_df = pd.DataFrame(all_starts, columns = ['race', 'starts'])
print(all_starts_df)

[Out]:
   race                                             starts
0     1      distance  ...                             ...
1     2      distance  ...                             ...
2     3      distance  ...                             ...
3     4      distance  ...                             ...
4     5      distance  ...                             ...
5     6      distance  ...                             ...
6     7      distance  ...                             ...

In output I want a DataFrame that is a merge of data on all starts from all races.在 output 中,我想要一个 DataFrame,它是所有种族的所有开始数据的合并。 Note that the number of columns can differ depending on which race, but that I expect in case one race has 21 columns and another has 20 columns - then the all_starts_df should contain all columns but in case a race do not have data for one column it should say 'NaN'.请注意,列数可能因种族而异,但我希望如果一场比赛有 21 列,而另一场比赛有 20 列 - 那么 all_starts_df 应该包含所有列,但如果一场比赛没有一列的数据应该说'NaN'。

Expected result:预期结果:

[Out]:
race  distance  driver.birth  ... result.column_20     result.column_22
1       1640          1984  ...             12500                   1
1       1640          1976  ...             11000                   2
2       2140          1968  ...               NaN                   1
2       2140          1953  ...               NaN                   2
3       3360          1968  ...              1500                 NaN
3       3360          1953  ...            250000                 NaN

If you want all columns you can try this.. (I find a lot more than 20 columns so I might have something wrong.)如果你想要所有列,你可以试试这个..(我发现超过 20 列,所以我可能有问题。)

all_starts = []
headers = []
for idx, race in enumerate(d['races']):
    df = json_normalize(race['starts'])
    df['race'] = idx
    all_starts.append(df.drop('videos', axis=1))
    headers.append(set(df.columns))

# Create set of all columns for all races
columns = set.union(*headers)

# If columns are missing from one dataframe add it (as np.nan)
for df in all_starts:
    for c in columns - set(df.columns):
        df[c] = np.nan

# Concatenate all dataframes for each race to make one dataframe
df_all_starts = pd.concat(all_starts, axis=0, sort=True)

Alternatively, if you know the names of the columns you want to keep, try this或者,如果您知道要保留的列的名称,请试试这个

columns = ['race', 'distance', 'driver.birth', 'result.prizeMoney']
all_starts = []

for idx, race in enumerate(d['races']):
    df = json_normalize(race['starts'])
    df['race'] = idx
    all_starts.append(df[columns])

# Concatenate all dataframes for each race to make one dataframe
df_all_starts = pd.concat(all_starts, axis=0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在已存在的数据框中添加来自json(字典数组)的行? - How to add rows from a json (arrays of dicts) in a dataframe already existing? 如何从一系列字典中创建 dataframe:新数据框将字典的键作为列 - How to create a dataframe from a series of dicts : new data frame will have keys of dicts as columns Dataframe 来自字典列表的字典? - Dataframe from a dict of lists of dicts? 如何从pandas数据框创建字典? - How to create a dict of dicts from pandas dataframe? 如何从字典列表中创建稀疏 DataFrame - How to create a sparse DataFrame from a list of dicts 如何从 arrays 数组创建 DataFrame 实例? - How to create a DataFrame instance from array of arrays? 如何在python中创建具有多个列表/数组的数据框 - How to create a dataframe with multiple lists/arrays in python 如何从多个单独的列表中创建字典列表? - How can I create a list of dicts from multiple separate lists? 来自dicts列表的Pandas DataFrame - Pandas DataFrame from list of lists of dicts Pandas:如何将 dicts 列表中的 dicts 列表展平到数据框中,如果嵌套列表中的任何 dict 缺少任何指定的键,则会抛出错误? - Pandas: How to flatten lists of dicts within a list of dicts into dataframe, throwing error if any dict in nested list is missing any specified keys?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM