繁体   English   中英

将非结构化 Json 转换为结构化 DataFrame

[英]Convert unstructured Json to structured DataFrame

我正在尝试阅读此 github Json(以下网址),其中包含来自足球队、比赛和球员的信息

这是我的示例代码:

import json
import pandas as pd
import urllib.request
from pandas import json_normalize

load_path = 'https://raw.githubusercontent.com/henriquepgomide/caRtola/master/data/2021/Mercado_10.txt'
games_2021 = json.loads(urllib.request.urlopen(load_path).read().decode('latin-1'))
games_2021 = json_normalize(games_2021)
games_2021

坏 output:

在此处输入图像描述

所需的 output 可以在下面的代码中看到:

pd.read_csv('https://raw.githubusercontent.com/henriquepgomide/caRtola/master/data/2022/rodada-0.csv')

在此处输入图像描述

两个 url 都包含相同的信息,但是 JSON 文件在我猜的字典模式中,其中初始信息正在翻译球员和球队可以拥有的一些值列,而另一个链接已经以某种方式清理,在 Csv 结构中.

只需标准化 json 中的'atleta'键即可。 或者只是将其构造成 DataFrame。

import json
import requests
import pandas as pd

load_path = 'https://raw.githubusercontent.com/henriquepgomide/caRtola/master/data/2021/Mercado_10.txt'
jsonData = requests.get(load_path).json()
games_2021 = pd.json_normalize(jsonData['atletas'])


cols = [x for x in games_2021.columns if 'scout.' not in x]
games_2021 = games_2021[cols]

或者

import json
import requests
import pandas as pd

load_path = 'https://raw.githubusercontent.com/henriquepgomide/caRtola/master/data/2021/Mercado_10.txt'
jsonData = requests.get(load_path).json()
games_2021 = pd.DataFrame(jsonData['atletas']).drop('scout', axis=1)

Output:

print(games_2021)
     atleta_id  ...                                               foto
0        83817  ...  https://s.glbimg.com/es/sde/f/2021/06/04/68300...
1        95799  ...  https://s.glbimg.com/es/sde/f/2020/07/28/e1784...
2        81798  ...  https://s.glbimg.com/es/sde/f/2021/04/19/7d895...
3        68808  ...  https://s.glbimg.com/es/sde/f/2021/04/19/ca9f7...
4        92496  ...  https://s.glbimg.com/es/sde/f/2020/08/28/8c0a6...
..         ...  ...                                                ...
755      50645  ...  https://s.glbimg.com/es/sde/f/2021/06/04/fae6b...
756      69345  ...  https://s.glbimg.com/es/sde/f/2021/05/01/0f714...
757     110465  ...  https://s.glbimg.com/es/sde/f/2021/04/26/a2187...
758     111578  ...  https://s.glbimg.com/es/sde/f/2021/04/27/21a13...
759      38315  ...  https://s.glbimg.com/es/sde/f/2020/10/09/a19dc...

[760 rows x 15 columns]

然后只需阅读每个表格并合并即可获得完整内容:

import json
import requests
import pandas as pd

load_path = 'https://raw.githubusercontent.com/henriquepgomide/caRtola/master/data/2021/Mercado_10.txt'
jsonData = requests.get(load_path).json()

atletas = pd.DataFrame(jsonData['atletas']).drop('scout', axis=1)
clubes = pd.DataFrame(jsonData['clubes'].values())
posicoes = pd.DataFrame(jsonData['posicoes'].values())
status = pd.DataFrame(jsonData['status'].values())

df = atletas.merge(clubes, how='left', left_on='clube_id', right_on='id', suffixes=['', '_clube'])
df = df.merge(posicoes, how='left', left_on='posicao_id', right_on='id', suffixes=['', '_posicao'])
df = df.merge(status, how='left', left_on='status_id', right_on='id', suffixes=['', '_status'])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM