如何將嵌套數據示例中的兩個值抽象為 pandas Dataframe？

Question

我正在使用來自 Standford 的數據集（請參閱 Dev Set 2.0）。 該文件采用 JSON 格式。 當我閱讀文件時，它是一本字典，但我將其更改為 DF：

import json
json_file = open("dev-v2.0.json", "r")
json_data = json.load(json_file)
json_file.close()

df = pd.DataFrame.from_dict(json_data)
df = df[0:2] # for this example, only a subset

我需要的所有信息都在df['data']列中。 在每一行中，有很多數據，格式如下：

{'title': 'Normans', 'paragraphs': [{'qas': [{'question': 'In what country is Normandy located?', 'id': '56ddde6b9a695914005b9628', 'answers': [{'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}, {'text': 'France', 'answer_start': 159}], 'is_impossible': False}, {'question': 'When were the Normans in Normandy?', 'id': '56ddde6b9a695914005b9629', 'answers': [{'text': '10th and 11th centuries', 'answer_start': 94}, {'text': 'in the 10th and 11th centuries', 'answer_start': 87}

我想從 DF 中的所有行中查詢所有問題和答案。 所以理想情況下，output 是這樣的：

Question                                         Answer 
'In what country is Normandy located?'          'France'
'When were the Normans in Normandy?'            'in the 10th and 11th centuries'

提前抱歉！ 我已經閱讀了“好例子”的帖子。 但是我發現很難為這個示例生成可重現的數據，因為它看起來像是一個字典，里面有一個列表，列表中有一個小字典，在另一個字典中，然后是字典......當我使用print (df["data"]) ，它只打印一個小子集......（這無助於重現這個問題）。

print(df['data'])
0    {'title': 'Normans', 'paragraphs': [{'qas': [{...
1    {'title': 'Computational_complexity_theory', '...
Name: data, dtype: object

提前謝謝了！

Answer 1

這應該讓你開始。

不確定如何處理答案字段為空的情況，因此您可能想提出更好的解決方案。 例子：

"question": " After 1945, what challenged the British empire?", "id": "5ad032b377cf76001a686e0d", "answers": [], "is_impossible": true

import json
import pandas as pd 


with open("dev-v2.0.json", "r") as f:
    data = json.loads(f.read())

questions, answers = [], []

for i in range(len(data["data"])):
    for j in range(len(data["data"][i]["paragraphs"])):
        for k in range(len(data["data"][i]["paragraphs"][j]["qas"])):
            q = data["data"][i]["paragraphs"][j]["qas"][k]["question"]
            try: # only takes first element since the rest of values are duplicated?
                a = data["data"][i]["paragraphs"][j]["qas"][k]["answers"][0]["text"]
            except IndexError: # when `"answers": []`
                a = "None"

            questions.append(q)
            answers.append(a)

d = {
    "Questions": questions,
    "Answers": answers
}

pd.DataFrame(d)

                                               Questions                      Answers
0                   In what country is Normandy located?                       France
1                     When were the Normans in Normandy?      10th and 11th centuries
2          From which countries did the Norse originate?  Denmark, Iceland and Norway
3                              Who was the Norse leader?                        Rollo
4      What century did the Normans first gain their ...                 10th century
...                                                  ...                          ...
11868  What is the seldom used force unit equal to on...                       sthène
11869           What does not have a metric counterpart?                         None
11870  What is the force exerted by standard gravity ...                         None
11871  What force leads to a commonly used unit of mass?                         None
11872        What force is part of the modern SI system?                         None

[11873 rows x 2 columns]

Answer 2

The following page (SQuAD (Stanford Q&A) json to Pandas DataFrame) deals with converting dev-v1.1.json to DataFrame.

如何將嵌套數據示例中的兩個值抽象為 pandas Dataframe？

問題描述

2 個解決方案

解決方案1
1 2019-10-07 12:23:16

解決方案2
1 已采納 2019-10-07 12:28:28

如何將嵌套數據示例中的兩個值抽象為 pandas Dataframe？

問題描述

2 個解決方案

解決方案1 1 2019-10-07 12:23:16

解決方案2 1 已采納 2019-10-07 12:28:28

解決方案1
1 2019-10-07 12:23:16

解決方案2
1 已采納 2019-10-07 12:28:28