如何在 DataFrame 中拼合字典並連接所有結果行

Question

我正在使用 Github 的 GraphQL API 來獲取一些問題的詳細信息。

我使用 Python 請求在本地獲取數據。 這就是 output.json 的樣子

{
    "data": {
        "viewer": {
            "login": "some_user"
        },
        "repository": {
            "issues": {
                "edges": [
                    {
                        "node": {
                            "id": "I_kwDOHQ63-s5auKbD",
                            "title": "test issue 1",
                            "number": 146,
                            "createdAt": "2023-01-06T06:39:54Z",
                            "closedAt": null,
                            "state": "OPEN",
                            "updatedAt": "2023-01-06T06:42:00Z",
                            "comments": {
                                "edges": [
                                    {
                                        "node": {
                                            "id": "IC_kwDOHQ63-s5R2XCV",
                                            "body": "comment 01"
                                        }
                                    },
                                    {
                                        "node": {
                                            "id": "IC_kwDOHQ63-s5R2XC9",
                                            "body": "comment 02"
                                        }
                                    }
                                ]
                            },
                            "labels": {
                                "edges": []
                            }
                        },
                        "cursor": "Y3Vyc29yOnYyOpHOWrimww=="
                    },
                    {
                        "node": {
                            "id": "I_kwDOHQ63-s5auKm8",
                            "title": "test issue 2",
                            "number": 147,
                            "createdAt": "2023-01-06T06:40:34Z",
                            "closedAt": null,
                            "state": "OPEN",
                            "updatedAt": "2023-01-06T06:40:34Z",
                            "comments": {
                                "edges": []
                            },
                            "labels": {
                                "edges": [
                                    {
                                        "node": {
                                            "name": "food"
                                        }
                                    },
                                    {
                                        "node": {
                                            "name": "healthy"
                                        }
                                    }
                                ]
                            }
                        },
                        "cursor": "Y3Vyc29yOnYyOpHOWripvA=="
                    }
                ]
            }
        }
    }
}

json 被放入一個列表中

result = response.json()["data"]["repository"]["issues"]["edges"]

然后這個列表被放在一個 DataFrame 里面

import pandas as pd
df = pd.DataFrame (result, columns = ['node', 'cursor'])
df

這些是數據框的內容

ID	標題	數字	創建於	關閉時間	state	更新時間	注釋	標簽
I_kwDOHQ63-s5auKbD	測試問題 1	146	2023-01-06T06:39:54Z	沒有任何	打開	2023-01-06T06:42:00Z	{'edges': [{'node': {'id': 'IC_kwDOHQ63-s5R2XCV","body": "comment 01"}},{'node': {'id': 'IC_kwDOHQ63-s5R2XC9","正文”：“評論 02”}}]}	{'邊緣'：[]}
I_kwDOHQ63-s5auKm8	測試問題 2	147	2023-01-06T06:40:34Z	沒有任何	打開	2023-01-06T06:40:34Z	{'邊緣'：[]}	{'edges': [{'node': {'name': 'food"}},{'node': {'name': 'healthy"}}]}

我想拆分/分解comments和labels列。 這些列中的值是嵌套字典

我希望單個問題的行數與comments和labels數一樣多。 我想展平數據框。 所以這涉及拆分/分解和連接。

有幾個 stackoverflow 答案深入探討了這個主題。 我已經嘗試了其中幾個的代碼。 我無法粘貼這些問題的鏈接，因為 stackoverflow 將我的問題標記為垃圾郵件，因為鏈接很多。 但這些是我嘗試過的步驟

df3 = df2['comments'].apply(pd.Series)

進一步向下鑽取

df4 = df3['edges'].apply(pd.Series)
df4

進一步向下鑽取

df5 = df4['node'].apply(pd.Series)
df5

上面的最后一條語句給了我KeyError: 'node'我明白了，這是因為 node 不是 DataFrame 中的鍵。

但是我還能如何拆分這本字典並將所有列連接回我的issues行？ 這就是我希望 output 的樣子

ID	標題	數字	創建於	關閉時間	state	更新時間	注釋	標簽
I_kwDOHQ63-s5auKbD	測試問題 1	146	2023-01-06T06:39:54Z	沒有任何	打開	2023-01-06T06:42:00Z	評論01	Null
I_kwDOHQ63-s5auKbD	測試問題 1	146	2023-01-06T06:39:54Z	沒有任何	打開	2023-01-06T06:42:00Z	評論02	Null
I_kwDOHQ63-s5auKm8	測試問題 2	147	2023-01-06T06:40:34Z	沒有任何	打開	2023-01-06T06:40:34Z	Null	食物
I_kwDOHQ63-s5auKm8	測試問題 2	147	2023-01-06T06:40:34Z	沒有任何	打開	2023-01-06T06:40:34Z	Null	健康

Answer 1

如果dct是您從問題中得出的字典，您可以嘗試：

df = pd.DataFrame(d['node'] for d in dct['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']

df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']

print(df.to_markdown(index=False))

印刷：

ID	標題	數字	創建於	state	更新時間	注釋	標簽
I_kwDOHQ63-s5auKbD	測試問題 1	146	2023-01-06T06:39:54Z	打開	2023-01-06T06:42:00Z	評論01	楠
I_kwDOHQ63-s5auKbD	測試問題 1	146	2023-01-06T06:39:54Z	打開	2023-01-06T06:42:00Z	評論02	楠
I_kwDOHQ63-s5auKm8	測試問題 2	147	2023-01-06T06:40:34Z	打開	2023-01-06T06:40:34Z	楠	食物
I_kwDOHQ63-s5auKm8	測試問題 2	147	2023-01-06T06:40:34Z	打開	2023-01-06T06:40:34Z	楠	健康

Answer 2

@andrej-kesely 回答了我的問題。
我選擇了他的回答作為這個問題的答案。
我現在發布了一個合並腳本，其中包括我的糟糕代碼和 andrej 的優秀代碼。

在此腳本中，我想從Github 的 GraphQL API 服務器獲取詳細信息。
並放入pandas里面。
這個腳本的主要來源是這個要點。
剩下的大部分代碼是@andrej-kesely 的回答。 現在進入綜合腳本。

首先導入必要的包並設置頭文件

import requests
import json
import pandas as pd

headers = {"Authorization": "token <your_github_personal_access_token>"}

現在定義將從 github 獲取數據的查詢。
在我的特殊情況下，我正在從一個特定的回購協議中獲取問題詳細信息，它可能是其他適合您的東西。

query = """
{
  viewer {
    login
  }
repository(name: "your_github_repo", owner: "your_github_user_name") {
  issues(states: OPEN, last: 2) {
    edges {
      node {
        id
        title
        number
        createdAt
        closedAt
        state
        updatedAt
        comments(first: 10) {
          edges {
            node {
              id
              body
            }
          }
        }
        labels(orderBy: {field: NAME, direction: ASC}, first: 10) {
          edges {
            node {
              name
            }
          }
        }
        comments(first: 10) {
          edges {
            node {
              id
              body
            }
          }
        }
      }
      cursor
    }
  }
}
}
"""

執行查詢並保存響應

def run_query(query):
    request = requests.post('https://api.github.com/graphql', json={'query': query}, headers=headers)
    if request.status_code == 200:
        return request.json()
    else:
        raise Exception("Query failed to run by returning code of {}. {}".format(request.status_code, query))

result = run_query(query)

現在是最棘手的部分。
在我的查詢響應中，有幾個嵌套字典。
我想把它們分開——更多細節在我上面的問題中。
來自@andrej-kesely 的這段神奇代碼會為你做到這一點。

df = pd.DataFrame(d['node'] for d in result['data']['repository']['issues']['edges'])
df['comments'] = df['comments'].str['edges']
df = df.explode('comments')
df['comments'] = df['comments'].str['node'].str['body']

df['labels'] = df['labels'].str['edges']
df = df.explode('labels')
df['labels'] = df['labels'].str['node'].str['name']

print(df)

如何在 DataFrame 中拼合字典並連接所有結果行

問題描述

2 個解決方案

解決方案1
1 已采納 2023-01-06 11:19:00

解決方案2
0 2023-01-06 12:56:08

如何在 DataFrame 中拼合字典並連接所有結果行

問題描述

2 個解決方案

解決方案1 1 已采納 2023-01-06 11:19:00

解決方案2 0 2023-01-06 12:56:08

解決方案1
1 已采納 2023-01-06 11:19:00

解決方案2
0 2023-01-06 12:56:08