Python & Pandas：使用 pd.json_normalize 展平嵌套 json

Question

Python 和 Pandas 的新手，致力於掌握 jsons 的竅門。 任何幫助表示贊賞。

通過 API 我正在拉一個嵌套的 json。json 的結構如下。 我之后的字段在view中，標記為user_id和message ，然后在嵌套字段下replies子字段user_id和message 。 所需的字段在下面用 <<< 標記

  ],
  "view": [
    {
      "id": 109205,
      "user_id": 6354, # <<<< this field
      "parent_id": null,
      "created_at": "2020-11-03T23:32:49Z",
      "updated_at": "2020-11-03T23:32:49Z",
      "rating_count": null,
      "rating_sum": null,
      "message": "message text1", # <<< this field
      "replies": [
        {
          "id": 109298,
          "user_id": 5457, # <<< this field
          "parent_id": 109205,
          "created_at": "2020-11-04T19:42:59Z",
          "updated_at": "2020-11-04T19:42:59Z",
          "rating_count": null,
          "rating_sum": null,
          "message": "message text2" # <<< this field
        },
        {
         #json continues

我可以成功地將頂級字段拉到view下，但是我很難用json_normalize壓平嵌套的 json 字段replies 。 這是我的工作代碼：

import pandas as pd

d = r.json() # json pulled from API

df = pd.json_normalize(d['view'], record_path=['replies'])

print(df)

這導致以下 KeyError：

Traceback (most recent call last):
  File "C:\Users\danie\AppData\Local\Temp\atom_script_tempfiles\2021720-13268-1xuqx61.3oh2g", line 53, in <module>
    df = pd.json_normalize(d['view'], record_path=['replies'])
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 336, in _json_normalize
    _recursive_extract(data, record_path, {}, level=0)
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 309, in _recursive_extract
    recs = _pull_records(obj, path[0])
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 248, in _pull_records
    result = _pull_field(js, spec)
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 239, in _pull_field
    result = result[spec]
KeyError: 'replies'

我在這里錯過了什么？ 歡迎和贊賞所有建議。

Answer 1

您正在嘗試拉平 json 文件中的 2 個不同“深度”，這無法在單個json_normalize調用中完成。 您可以簡單地使用 2 pd.json_normalize因為所有條目都包含id以便稍后匹配所有已解析的數據：

>>> pd.json_normalize(d, record_path='view')
       id  user_id parent_id            created_at            updated_at rating_count rating_sum        message                                            replies
0  109205     6354      None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z         None       None  message text1  [{'id': 109298, 'user_id': 5457, 'parent_id': ...
>>> pd.json_normalize(d, record_path=['view', 'replies'])
       id  user_id  parent_id            created_at            updated_at rating_count rating_sum        message
0  109298     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text2
1  109299     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text3

（我已經將相同數據和id遞增 1 添加為您示例的第二個reply ，這樣我們就可以看到每個視圖的多個回復會發生什么。）

或者，您可以在先前結果的replies列上使用第二個pd.json_normalize ，這可能工作量較小。 如果您首先對列進行.explode()以獲得每個回復一行，這會更有趣：

>>> pd.json_normalize(view['replies'].explode())
       id  user_id  parent_id            created_at            updated_at rating_count rating_sum        message
0  109298     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text2
1  109299     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text3

因此，這是一種使用所有信息構建單個 dataframe 的方法：

>>> view = pd.json_normalize(d, record_path='view')
>>> df = pd.merge(
...     view.drop(columns=['replies']),
...     pd.json_normalize(view['replies'].explode()),
...     left_on='id', right_on='parent_id', how='right',
...     suffixes=('_view', '_reply')
... )
>>> df
   id_view  user_id_view parent_id_view       created_at_view       updated_at_view rating_count_view rating_sum_view   message_view  id_reply  user_id_reply  parent_id_reply      created_at_reply      updated_at_reply rating_count_reply rating_sum_reply  message_reply
0   109205          6354           None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z              None            None  message text1    109298           5457           109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z               None             None  message text2
1   109205          6354           None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z              None            None  message text1    109299           5457           109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z               None             None  message text3
>>> df[['user_id_view', 'message_view', 'user_id_reply', 'message_reply']]
   user_id_view   message_view  user_id_reply  message_reply
0          6354  message text1           5457  message text2
1          6354  message text1           5457  message text3

Python & Pandas：使用 pd.json_normalize 展平嵌套 json

問題描述

1 個解決方案

解決方案1
3 已采納 2021-08-20 15:42:58

Python & Pandas：使用 pd.json_normalize 展平嵌套 json

問題描述

1 個解決方案

解決方案1 3 已采納 2021-08-20 15:42:58

解決方案1
3 已采納 2021-08-20 15:42:58