简体   繁体   中英

Parse massive JSON string from Tweepy or convert to dict/JSON format

My first time using Tweepy and I am a Python novice. I used the following code following the OAuth to collect tweets using Tweepy:

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
file = open('SOTU1.txt', 'a')

class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
    print status.text

def on_data(self, data):
    json_data = json.loads(data)
    file.write(str(json_data))

def on_error(self, status_code):
    print >> sys.stderr, 'Encountered error with status code:', status_code
    return True # Don't kill the stream

def on_timeout(self):
    print >> sys.stderr, 'Timeout...'
    return True # Don't kill the stream

And the resultant text file looks like this and continues on as one string object:

{u'contributors': None, u'truncated': False, u'text': u'Lost my cool today           
\U0001f602\U0001f63e like completely', u'in_reply_to_status_id': None, u'id': 
557709279751581696, u'favorite_count': 0, u'source': u'<a 
href="http://twitter.com/download/android" rel="nofollow">Twitter for 
Android</a>', u'retweeted': False, u'coordinates': {u'type': u'Point', 
u'coordinates': [-97.925459, 29.877993]}, u'timestamp_ms': u'1421803228687', 
u'entities': {u'user_mentions': [], u'symbols': [], u'trends': [], 
u'hashtags': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': 
u'557709279751581696', u'retweet_count': 0, u'in_reply_to_user_id': None, 
u'favorited': False, u'user': {u'follow_request_sent': None, 
u'profile_use_background_image': True, u'default_profile_image': False, u'id': 
1239731318, u'verified': False, u'profile_image_url_https': 

I have tried various solutions offered on the site, although none worked because it is not a list, but a string. I have tried to make it into dictionary form by removing the "u'", but the right side of the pair has words not enclosed by "".

My goal is to extract the text and geocode from each tweet and I am hoping to process the JSON file in bash using jq. But as of now I cannot feed this data to jq, and it is hard to identify which batch of lines come from a single tweet.

Thanks in advance!

def on_data(self, data):
    json_data = json.loads(data)
    json.dump(json_data,my_file)

then when you want it back

json_data = json.load(open("file.txt"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM