简体   繁体   中英

How to create Pandas DataFrame by reading list of dictionary from txt file?

I downloaded Twitter data with tweepy and store each tweet in tweet_data.

tweet_data = []

for tweet_id in tweet_id_list:
        try:
            tweet_line = api.get_status(tweet_id,
                                        trim_user = True, 
                                        include_my_retweet = False,
                                        include_entities = False,
                                        include_ext_alt_text = False, 
                                        tweet_mode = 'extended')
            
            tweet_data.append(tweet_line)

        except:
            continue # if tweet_id not found in twitter, move on to next tweet_id

Put tweet_data into 'twitter_json.txt'.

with open('twitter_json.txt', 'w') as txt:
    for data in tweet_data:
        tweet = data._json
        tweet = json.dumps(tweet)
        try:
            txt.write(tweet + '\n')
        except Exception as e:
            print(e)

Here are part of data from the text file.

{"created_at": "Tue Aug 01 16:23:56 +0000 2017", "id": sample_01, "id_str": sample_01, "full_text": "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 ", "truncated": false, "display_text_range": [0, 85], "extended_entities": {"media": [{"id": 892420639486877696, "id_str": "892420639486877696", "indices": [86, 109], "media_url": "some_url", "media_url_": "some_url", "url": some_url, "display_url": some_url, "expanded_url": some_url, "type": "photo", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 540, "h": 528, "resize": "fit"}, "small": {"w": 540, "h": 528, "resize": "fit"}, "large": {"w": 540, "h": 528, "resize": "fit"}}}]}, "source": "<a some_url", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 4196983835, "id_str": "4196983835"}, "geo": null, "coordinates": null, "place": null, "contributors": null, "is_quote_status": false, "retweet_count": 7427, "favorite_count": 35179, "favorited": false, "retweeted": false, "possibly_sensitive": false, "possibly_sensitive_appealable": false, "lang": "en"}
{"created_at": "Tue Aug 01 00:17:27 +0000 2017", "id": sample_02, "id_str": sample_02, "full_text": "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 some_url", "truncated": false, "display_text_range": [0, 138], "extended_entities": {"media": [{"id": 892177413194625024, "id_str": "892177413194625024", "indices": [139, 162], "media_url": "some_url", "media_url_": "some_url", "url": "some_url", "display_url": "some_url", "expanded_url": "some_url", "type": "photo", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 1055, "h": 1200, "resize": "fit"}, "small": {"w": 598, "h": 680, "resize": "fit"}, "large": {"w": 1407, "h": 1600, "resize": "fit"}}}]}, "source": "some_url", "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": {"id": 4196983835, "id_str": "4196983835"}, "geo": null, "coordinates": null, "place": null, "contributors": null, "is_quote_status": false, "retweet_count": 5524, "favorite_count": 30458, "favorited": false, "retweeted": false, "possibly_sensitive": false, "possibly_sensitive_appealable": false, "lang": "en"}

Next step...read 'twitter_json.txt' file and I want to create a DataFrame with pandas.

with open('twitter_json.txt') as txt:
    data = [line.strip() for line in txt]

Here is a snapshot of data frame created, result doesn't seem quite right.

print(pd.DataFrame(data))

                                                0
0  {"created_at": "Tue Aug 01 16:23:56 +0000 2017...
1  {"created_at": "Tue Aug 01 00:17:27 +0000 2017...

I expect the dataframe to have columns such as "created_at", "id", "id_str" etc. How do I do that?

If you modify your workflow slightly, it'll work. I used different write/read routines to make this work. also, I'm using my own data, so the output won't be your data.

# create list of json formats first, then write to file    
write_data = [tweet._json for tweet in tweet_data]

# write to file
f = open('twitter_json.txt', "w+")
f.write(json.dumps(write_data))
f.close()

# read with json.loads
with open('twitter_json.txt', 'rb') as f:
    data = json.loads(f.read().decode('utf-8'))

pd.DataFrame(data)

Output

                       created_at                   id               id_str                                          full_text  truncated display_text_range  ... retweet_count favorite_count favorited retweeted possibly_sensitive lang
0  Fri Jan 08 11:16:09 +0000 2021  1347502345517735940  1347502345517735940  🚇 La suite du voyage du futur métro d’Hanoï, f...      False           [0, 211]  ...             1              3     False     False              False   fr
1  Fri Jan 08 11:15:31 +0000 2021  1347502185920286722  1347502185920286722  🚇 The continuation of the journey of the futur...      False           [0, 211]  ...             1              5     False     False              False   en

[2 rows x 26 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM