[英]Reading nested JSON file in Pandas dataframe
I have a JSON file with the following structure (it's not the complete json file, but the structure is the same):我有一个 JSON 文件,其结构如下(它不是完整的 json 文件,但结构相同):
{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
.....
//The rest of json continues with the same structure, but referenced_tweets is not always present
My question: How can I load this data into a dataframe with these columns: type
, id(referenced_tweet id)
, text
, created_at
, author_id
, and id (tweet id)
?我的问题:如何将这些数据加载到具有以下列的 dataframe 中:
type
、 id(referenced_tweet id)
、 text
、 created_at
、 author_id
和id (tweet id)
?
What I could do so far: I could get the following columns:到目前为止我能做什么:我可以获得以下列:
referenced_tweets![]() |
text![]() |
cerated_at ![]() |
author_id ![]() |
id (tweet id) ![]() |
---|---|---|---|---|
[{'type': 'xx', 'id': 'xxx'}] ![]() |
xxx ![]() |
xxxx ![]() |
xxxxx ![]() |
xxxxxxxxxxxx ![]() |
Here is the code to get the above table:以下是获取上表的代码:
with open('Test_SampleRetweets.json') as json_file:
data_list = json.load(json_file)
df1 = json_normalize(data_list, 'data')
df1.head()
However, I'd like to get the type
and id
(in referenced_tweets) in separate columns and I could get the following so far:但是,我想在单独的列中获取
type
和id
(在 referenced_tweets 中),到目前为止我可以获得以下信息:
type![]() |
id (referenced_tweet id) ![]() |
---|---|
xxxx ![]() |
xxxxxxxxxxxxxxxxxxxxxxx ![]() |
and here is the code to get the above table:这是获取上表的代码:
df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()
What is the problem?问题是什么? I'd like to get everything in one table, ie, a table similar to the first one here but with
type
and id
in separate columns (like the 2nd table).我想把所有东西都放在一个表中,即类似于这里的第一个表,但
type
和id
在单独的列中(如第二个表)。 So, the final columns should be: type
, id (referenced_tweet id)
, text
, created_at
, author_id
, and id (tweet id)
.因此,最后的列应该是:
type
、 id (referenced_tweet id)
、 text
、 created_at
、 author_id
和id (tweet id)
。
import pandas as pd
with open('Test_SampleRetweets.json') as json_file:
raw_data = json.load(json_file)
data = []
for item in raw_data["data"]:
item["tweet_id"] = item["id"]
item.update(item["referenced_tweets"][0])
del item["referenced_tweets"]
data.append(item)
df1 = pd.DataFrame(data)
print(df1.head())
when working with a nested json in json_normalize()
, you need to work with the meta
parameter to get the fields in the meta level.在
json_normalize()
中使用嵌套的 json 时,您需要使用meta
参数来获取元级别中的字段。 So, essentially what you are doing is taking the nested and normalizing it and than left joining several other field from a level above.所以,本质上你正在做的是将嵌套和规范化,然后从上面的级别加入其他几个字段。 Apparently, you can combine this for several nested fields, see this for reference.
显然,您可以将其组合到多个嵌套字段中,请参阅此以供参考。
import json
import pandas as pd
j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)
# since you have id twice, it's a bit more complicated and you need to
# introduce a meta prefix
df = pd.json_normalize(
j,
record_path=["data", 'referenced_tweets'],
meta_prefix="data.",
meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
)
print(df)
resulting in:导致:
type id data.data.text data.data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.data.author_id data.data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxx
I would prefere it this way, since it seems simpler to handle我更喜欢这种方式,因为它看起来更容易处理
df = pd.json_normalize(
j["data"],
record_path=['referenced_tweets'],
meta_prefix="data.",
meta=["text", "created_at", "author_id", "id"]
)
print(df)
resulting in:导致:
type id data.text data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.author_id data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxxx
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.