简体   繁体   中英

Reading nested JSON file in Pandas dataframe

I have a JSON file with the following structure (it's not the complete json file, but the structure is the same):

{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
..... 

//The rest of json continues with the same structure, but referenced_tweets is not always present  

My question: How can I load this data into a dataframe with these columns: type , id(referenced_tweet id) , text , created_at , author_id , and id (tweet id) ?

What I could do so far: I could get the following columns:

referenced_tweets text cerated_at author_id id (tweet id)
[{'type': 'xx', 'id': 'xxx'}] xxx xxxx xxxxx xxxxxxxxxxxx

Here is the code to get the above table:

with open('Test_SampleRetweets.json') as json_file:
    data_list = json.load(json_file)

df1 = json_normalize(data_list, 'data')
df1.head()

However, I'd like to get the type and id (in referenced_tweets) in separate columns and I could get the following so far:

type id (referenced_tweet id)
xxxx xxxxxxxxxxxxxxxxxxxxxxx

and here is the code to get the above table:

df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()

What is the problem? I'd like to get everything in one table, ie, a table similar to the first one here but with type and id in separate columns (like the 2nd table). So, the final columns should be: type , id (referenced_tweet id) , text , created_at , author_id , and id (tweet id) .

import pandas as pd


with open('Test_SampleRetweets.json') as json_file:
    raw_data = json.load(json_file)


data = []
for item in raw_data["data"]:
    item["tweet_id"] = item["id"]
    item.update(item["referenced_tweets"][0])
    del item["referenced_tweets"]
    data.append(item)


df1 = pd.DataFrame(data)
print(df1.head())

when working with a nested json in json_normalize() , you need to work with the meta parameter to get the fields in the meta level. So, essentially what you are doing is taking the nested and normalizing it and than left joining several other field from a level above. Apparently, you can combine this for several nested fields, see this for reference.

import json
import pandas as pd

j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)

# since you have id twice, it's a bit more complicated and you need to 
# introduce a meta prefix
df = pd.json_normalize(
    j,
    record_path=["data", 'referenced_tweets'],
    meta_prefix="data.",
    meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
    )
print(df)

resulting in:

        type            id data.data.text      data.data.created_at  \
0  retweeted       xxxxxxx   abcdefghijkl  2020-03-09T00:11:41.000Z   
1  retweeted  xxxxxxxxxxxx   abcdefghijkl  2020-03-09T00:11:41.000Z   

  data.data.author_id data.data.id  
0               xxxxx  xxxxxxxxxxx  
1            xxxxxxxx  xxxxxxxxxx

I would prefere it this way, since it seems simpler to handle

df = pd.json_normalize(
    j["data"],
    record_path=['referenced_tweets'],
    meta_prefix="data.",
    meta=["text", "created_at", "author_id", "id"]
    )
print(df)

resulting in:

        type            id     data.text           data.created_at  \
0  retweeted       xxxxxxx  abcdefghijkl  2020-03-09T00:11:41.000Z   
1  retweeted  xxxxxxxxxxxx  abcdefghijkl  2020-03-09T00:11:41.000Z   

  data.author_id      data.id  
0          xxxxx  xxxxxxxxxxx  
1       xxxxxxxx  xxxxxxxxxxx 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM