[英]Reading nested JSON file in Pandas dataframe
我有一个 JSON 文件,其结构如下(它不是完整的 json 文件,但结构相同):
{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
.....
//The rest of json continues with the same structure, but referenced_tweets is not always present
我的问题:如何将这些数据加载到具有以下列的 dataframe 中: type
、 id(referenced_tweet id)
、 text
、 created_at
、 author_id
和id (tweet id)
?
到目前为止我能做什么:我可以获得以下列:
参考推文 | 文本 | cerated_at | author_id | id(推文id) |
---|---|---|---|---|
[{'类型':'xx','id':'xxx'}] | xxx | xxxx | xxxxx | xxxxxxxxxxxx |
以下是获取上表的代码:
with open('Test_SampleRetweets.json') as json_file:
data_list = json.load(json_file)
df1 = json_normalize(data_list, 'data')
df1.head()
但是,我想在单独的列中获取type
和id
(在 referenced_tweets 中),到目前为止我可以获得以下信息:
类型 | id(referenced_tweet id) |
---|---|
xxxx | xxxxxxxxxxxxxxxxxxxxxxxx |
这是获取上表的代码:
df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()
问题是什么? 我想把所有东西都放在一个表中,即类似于这里的第一个表,但type
和id
在单独的列中(如第二个表)。 因此,最后的列应该是: type
、 id (referenced_tweet id)
、 text
、 created_at
、 author_id
和id (tweet id)
。
import pandas as pd
with open('Test_SampleRetweets.json') as json_file:
raw_data = json.load(json_file)
data = []
for item in raw_data["data"]:
item["tweet_id"] = item["id"]
item.update(item["referenced_tweets"][0])
del item["referenced_tweets"]
data.append(item)
df1 = pd.DataFrame(data)
print(df1.head())
在json_normalize()
中使用嵌套的 json 时,您需要使用meta
参数来获取元级别中的字段。 所以,本质上你正在做的是将嵌套和规范化,然后从上面的级别加入其他几个字段。 显然,您可以将其组合到多个嵌套字段中,请参阅此以供参考。
import json
import pandas as pd
j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)
# since you have id twice, it's a bit more complicated and you need to
# introduce a meta prefix
df = pd.json_normalize(
j,
record_path=["data", 'referenced_tweets'],
meta_prefix="data.",
meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
)
print(df)
导致:
type id data.data.text data.data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.data.author_id data.data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxx
我更喜欢这种方式,因为它看起来更容易处理
df = pd.json_normalize(
j["data"],
record_path=['referenced_tweets'],
meta_prefix="data.",
meta=["text", "created_at", "author_id", "id"]
)
print(df)
导致:
type id data.text data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.author_id data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxxx
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.