簡體   English   中英

Python & Pandas:使用 pd.json_normalize 展平嵌套 json

[英]Python & Pandas: Flattening nested json with pd.json_normalize

Python 和 Pandas 的新手,致力於掌握 jsons 的竅門。 任何幫助表示贊賞。

通過 API 我正在拉一個嵌套的 json。json 的結構如下。 我之后的字段在view中,標記為user_idmessage ,然后在嵌套字段下replies子字段user_idmessage 所需的字段在下面用 <<< 標記

  ],
  "view": [
    {
      "id": 109205,
      "user_id": 6354, # <<<< this field
      "parent_id": null,
      "created_at": "2020-11-03T23:32:49Z",
      "updated_at": "2020-11-03T23:32:49Z",
      "rating_count": null,
      "rating_sum": null,
      "message": "message text1", # <<< this field
      "replies": [
        {
          "id": 109298,
          "user_id": 5457, # <<< this field
          "parent_id": 109205,
          "created_at": "2020-11-04T19:42:59Z",
          "updated_at": "2020-11-04T19:42:59Z",
          "rating_count": null,
          "rating_sum": null,
          "message": "message text2" # <<< this field
        },
        {
         #json continues

我可以成功地將頂級字段拉到view下,但是我很難用json_normalize壓平嵌套的 json 字段replies 這是我的工作代碼:

import pandas as pd

d = r.json() # json pulled from API

df = pd.json_normalize(d['view'], record_path=['replies'])

print(df)

這導致以下 KeyError:

Traceback (most recent call last):
  File "C:\Users\danie\AppData\Local\Temp\atom_script_tempfiles\2021720-13268-1xuqx61.3oh2g", line 53, in <module>
    df = pd.json_normalize(d['view'], record_path=['replies'])
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 336, in _json_normalize
    _recursive_extract(data, record_path, {}, level=0)
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 309, in _recursive_extract
    recs = _pull_records(obj, path[0])
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 248, in _pull_records
    result = _pull_field(js, spec)
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 239, in _pull_field
    result = result[spec]
KeyError: 'replies'

我在這里錯過了什么? 歡迎和贊賞所有建議。

您正在嘗試拉平 json 文件中的 2 個不同“深度”,這無法在單個json_normalize調用中完成。 您可以簡單地使用 2 pd.json_normalize因為所有條目都包含id以便稍后匹配所有已解析的數據:

>>> pd.json_normalize(d, record_path='view')
       id  user_id parent_id            created_at            updated_at rating_count rating_sum        message                                            replies
0  109205     6354      None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z         None       None  message text1  [{'id': 109298, 'user_id': 5457, 'parent_id': ...
>>> pd.json_normalize(d, record_path=['view', 'replies'])
       id  user_id  parent_id            created_at            updated_at rating_count rating_sum        message
0  109298     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text2
1  109299     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text3

(我已經將相同數據和id遞增 1 添加為您示例的第二個reply ,這樣我們就可以看到每個視圖的多個回復會發生什么。)

或者,您可以在先前結果的replies列上使用第二個pd.json_normalize ,這可能工作量較小。 如果您首先對列進行.explode()以獲得每個回復一行,這會更有趣:

>>> pd.json_normalize(view['replies'].explode())
       id  user_id  parent_id            created_at            updated_at rating_count rating_sum        message
0  109298     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text2
1  109299     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text3

因此,這是一種使用所有信息構建單個 dataframe 的方法:

>>> view = pd.json_normalize(d, record_path='view')
>>> df = pd.merge(
...     view.drop(columns=['replies']),
...     pd.json_normalize(view['replies'].explode()),
...     left_on='id', right_on='parent_id', how='right',
...     suffixes=('_view', '_reply')
... )
>>> df
   id_view  user_id_view parent_id_view       created_at_view       updated_at_view rating_count_view rating_sum_view   message_view  id_reply  user_id_reply  parent_id_reply      created_at_reply      updated_at_reply rating_count_reply rating_sum_reply  message_reply
0   109205          6354           None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z              None            None  message text1    109298           5457           109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z               None             None  message text2
1   109205          6354           None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z              None            None  message text1    109299           5457           109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z               None             None  message text3
>>> df[['user_id_view', 'message_view', 'user_id_reply', 'message_reply']]
   user_id_view   message_view  user_id_reply  message_reply
0          6354  message text1           5457  message text2
1          6354  message text1           5457  message text3

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM