I have a JSON file with the following structure (it's not the complete json file, but the structure is the same):
{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
.....
//The rest of json continues with the same structure, but referenced_tweets is not always present
My question: How can I load this data into a dataframe with these columns: type
, id(referenced_tweet id)
, text
, created_at
, author_id
, and id (tweet id)
?
What I could do so far: I could get the following columns:
referenced_tweets | text | cerated_at | author_id | id (tweet id) |
---|---|---|---|---|
[{'type': 'xx', 'id': 'xxx'}] | xxx | xxxx | xxxxx | xxxxxxxxxxxx |
Here is the code to get the above table:
with open('Test_SampleRetweets.json') as json_file:
data_list = json.load(json_file)
df1 = json_normalize(data_list, 'data')
df1.head()
However, I'd like to get the type
and id
(in referenced_tweets) in separate columns and I could get the following so far:
type | id (referenced_tweet id) |
---|---|
xxxx | xxxxxxxxxxxxxxxxxxxxxxx |
and here is the code to get the above table:
df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()
What is the problem? I'd like to get everything in one table, ie, a table similar to the first one here but with type
and id
in separate columns (like the 2nd table). So, the final columns should be: type
, id (referenced_tweet id)
, text
, created_at
, author_id
, and id (tweet id)
.
import pandas as pd
with open('Test_SampleRetweets.json') as json_file:
raw_data = json.load(json_file)
data = []
for item in raw_data["data"]:
item["tweet_id"] = item["id"]
item.update(item["referenced_tweets"][0])
del item["referenced_tweets"]
data.append(item)
df1 = pd.DataFrame(data)
print(df1.head())
when working with a nested json in json_normalize()
, you need to work with the meta
parameter to get the fields in the meta level. So, essentially what you are doing is taking the nested and normalizing it and than left joining several other field from a level above. Apparently, you can combine this for several nested fields, see this for reference.
import json
import pandas as pd
j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)
# since you have id twice, it's a bit more complicated and you need to
# introduce a meta prefix
df = pd.json_normalize(
j,
record_path=["data", 'referenced_tweets'],
meta_prefix="data.",
meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
)
print(df)
resulting in:
type id data.data.text data.data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.data.author_id data.data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxx
I would prefere it this way, since it seems simpler to handle
df = pd.json_normalize(
j["data"],
record_path=['referenced_tweets'],
meta_prefix="data.",
meta=["text", "created_at", "author_id", "id"]
)
print(df)
resulting in:
type id data.text data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.author_id data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxxx
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.