[英]Python & Pandas: Flattening nested json with pd.json_normalize
Python 和 Pandas 的新手,致力於掌握 jsons 的竅門。 任何幫助表示贊賞。
通過 API 我正在拉一個嵌套的 json。json 的結構如下。 我之后的字段在view
中,標記為user_id
和message
,然后在嵌套字段下replies
子字段user_id
和message
。 所需的字段在下面用 <<< 標記
],
"view": [
{
"id": 109205,
"user_id": 6354, # <<<< this field
"parent_id": null,
"created_at": "2020-11-03T23:32:49Z",
"updated_at": "2020-11-03T23:32:49Z",
"rating_count": null,
"rating_sum": null,
"message": "message text1", # <<< this field
"replies": [
{
"id": 109298,
"user_id": 5457, # <<< this field
"parent_id": 109205,
"created_at": "2020-11-04T19:42:59Z",
"updated_at": "2020-11-04T19:42:59Z",
"rating_count": null,
"rating_sum": null,
"message": "message text2" # <<< this field
},
{
#json continues
我可以成功地將頂級字段拉到view
下,但是我很難用json_normalize
壓平嵌套的 json 字段replies
。 這是我的工作代碼:
import pandas as pd
d = r.json() # json pulled from API
df = pd.json_normalize(d['view'], record_path=['replies'])
print(df)
這導致以下 KeyError:
Traceback (most recent call last):
File "C:\Users\danie\AppData\Local\Temp\atom_script_tempfiles\2021720-13268-1xuqx61.3oh2g", line 53, in <module>
df = pd.json_normalize(d['view'], record_path=['replies'])
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 336, in _json_normalize
_recursive_extract(data, record_path, {}, level=0)
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 309, in _recursive_extract
recs = _pull_records(obj, path[0])
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 248, in _pull_records
result = _pull_field(js, spec)
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 239, in _pull_field
result = result[spec]
KeyError: 'replies'
我在這里錯過了什么? 歡迎和贊賞所有建議。
您正在嘗試拉平 json 文件中的 2 個不同“深度”,這無法在單個json_normalize
調用中完成。 您可以簡單地使用 2 pd.json_normalize
因為所有條目都包含id
以便稍后匹配所有已解析的數據:
>>> pd.json_normalize(d, record_path='view')
id user_id parent_id created_at updated_at rating_count rating_sum message replies
0 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 [{'id': 109298, 'user_id': 5457, 'parent_id': ...
>>> pd.json_normalize(d, record_path=['view', 'replies'])
id user_id parent_id created_at updated_at rating_count rating_sum message
0 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
(我已經將相同數據和id
遞增 1 添加為您示例的第二個reply
,這樣我們就可以看到每個視圖的多個回復會發生什么。)
或者,您可以在先前結果的replies
列上使用第二個pd.json_normalize
,這可能工作量較小。 如果您首先對列進行.explode()
以獲得每個回復一行,這會更有趣:
>>> pd.json_normalize(view['replies'].explode())
id user_id parent_id created_at updated_at rating_count rating_sum message
0 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
因此,這是一種使用所有信息構建單個 dataframe 的方法:
>>> view = pd.json_normalize(d, record_path='view')
>>> df = pd.merge(
... view.drop(columns=['replies']),
... pd.json_normalize(view['replies'].explode()),
... left_on='id', right_on='parent_id', how='right',
... suffixes=('_view', '_reply')
... )
>>> df
id_view user_id_view parent_id_view created_at_view updated_at_view rating_count_view rating_sum_view message_view id_reply user_id_reply parent_id_reply created_at_reply updated_at_reply rating_count_reply rating_sum_reply message_reply
0 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
>>> df[['user_id_view', 'message_view', 'user_id_reply', 'message_reply']]
user_id_view message_view user_id_reply message_reply
0 6354 message text1 5457 message text2
1 6354 message text1 5457 message text3
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.