读取 Pandas dataframe 中的嵌套 JSON 文件

Question

I have a JSON file with the following structure (it's not the complete json file, but the structure is the same):我有一个 JSON 文件，其结构如下（它不是完整的 json 文件，但结构相同）：

{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
..... 

//The rest of json continues with the same structure, but referenced_tweets is not always present

My question: How can I load this data into a dataframe with these columns: type , id(referenced_tweet id) , text , created_at , author_id , and id (tweet id) ?我的问题：如何将这些数据加载到具有以下列的 dataframe 中： type 、 id(referenced_tweet id) 、 text 、 created_at 、 author_id和id (tweet id) ？

What I could do so far: I could get the following columns:到目前为止我能做什么：我可以获得以下列：

referenced_tweets参考推文	text文本	cerated_at cerated_at	author_id author_id	id (tweet id) id（推文id）
[{'type': 'xx', 'id': 'xxx'}] [{'类型'：'xx'，'id'：'xxx'}]	xxx xxx	xxxx xxxx	xxxxx xxxxx	xxxxxxxxxxxx xxxxxxxxxxxx

Here is the code to get the above table:以下是获取上表的代码：

with open('Test_SampleRetweets.json') as json_file:
    data_list = json.load(json_file)

df1 = json_normalize(data_list, 'data')
df1.head()

However, I'd like to get the type and id (in referenced_tweets) in separate columns and I could get the following so far:但是，我想在单独的列中获取type和id （在 referenced_tweets 中），到目前为止我可以获得以下信息：

type类型	id (referenced_tweet id) id（referenced_tweet id）
xxxx xxxx	xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx

and here is the code to get the above table:这是获取上表的代码：

df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()

What is the problem?问题是什么？ I'd like to get everything in one table, ie, a table similar to the first one here but with type and id in separate columns (like the 2nd table).我想把所有东西都放在一个表中，即类似于这里的第一个表，但type和id在单独的列中（如第二个表）。 So, the final columns should be: type , id (referenced_tweet id) , text , created_at , author_id , and id (tweet id) .因此，最后的列应该是： type 、 id (referenced_tweet id) 、 text 、 created_at 、 author_id和id (tweet id) 。

Answer 1

import pandas as pd


with open('Test_SampleRetweets.json') as json_file:
    raw_data = json.load(json_file)


data = []
for item in raw_data["data"]:
    item["tweet_id"] = item["id"]
    item.update(item["referenced_tweets"][0])
    del item["referenced_tweets"]
    data.append(item)


df1 = pd.DataFrame(data)
print(df1.head())

Answer 2

when working with a nested json in json_normalize() , you need to work with the meta parameter to get the fields in the meta level.在json_normalize()中使用嵌套的 json 时，您需要使用meta参数来获取元级别中的字段。 So, essentially what you are doing is taking the nested and normalizing it and than left joining several other field from a level above.所以，本质上你正在做的是将嵌套和规范化，然后从上面的级别加入其他几个字段。 Apparently, you can combine this for several nested fields, see this for reference.显然，您可以将其组合到多个嵌套字段中，请参阅此以供参考。

import json
import pandas as pd

j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)

# since you have id twice, it's a bit more complicated and you need to 
# introduce a meta prefix
df = pd.json_normalize(
    j,
    record_path=["data", 'referenced_tweets'],
    meta_prefix="data.",
    meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
    )
print(df)

resulting in:导致：

        type            id data.data.text      data.data.created_at  \
0  retweeted       xxxxxxx   abcdefghijkl  2020-03-09T00:11:41.000Z   
1  retweeted  xxxxxxxxxxxx   abcdefghijkl  2020-03-09T00:11:41.000Z   

  data.data.author_id data.data.id  
0               xxxxx  xxxxxxxxxxx  
1            xxxxxxxx  xxxxxxxxxx

I would prefere it this way, since it seems simpler to handle我更喜欢这种方式，因为它看起来更容易处理

df = pd.json_normalize(
    j["data"],
    record_path=['referenced_tweets'],
    meta_prefix="data.",
    meta=["text", "created_at", "author_id", "id"]
    )
print(df)

resulting in:导致：

        type            id     data.text           data.created_at  \
0  retweeted       xxxxxxx  abcdefghijkl  2020-03-09T00:11:41.000Z   
1  retweeted  xxxxxxxxxxxx  abcdefghijkl  2020-03-09T00:11:41.000Z   

  data.author_id      data.id  
0          xxxxx  xxxxxxxxxxx  
1       xxxxxxxx  xxxxxxxxxxx

读取 Pandas dataframe 中的嵌套 JSON 文件

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-03-27 10:00:36

解决方案2
1 2021-03-27 10:54:25

读取 Pandas dataframe 中的嵌套 JSON 文件

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-03-27 10:00:36

解决方案2 1 2021-03-27 10:54:25

解决方案1
1 已采纳 2021-03-27 10:00:36

解决方案2
1 2021-03-27 10:54:25