简体   繁体   English

如何通过tweepy保存json中的流媒体推文?

[英]How do I save streaming tweets in json via tweepy?

I've been learning Python for a couple of months through online courses and would like to further my learning through a real world mini project. 我通过在线课程学习了几个月的Python,并希望通过一个真实的迷你项目来进一步学习。

For this project, I would like to collect tweets from the twitter streaming API and store them in json format (though you can choose to just save the key information like status.text, status.id, I've been advised that the best way to do this is to save all the data and do the processing after). 对于这个项目,我想从twitter流API收集推文并以json格式存储它们(虽然你可以选择保存关键信息,如status.text,status.id,我被告知最好的方法这样做是为了保存所有数据并在之后进行处理。 However, with the addition of the on_data() the code ceases to work. 但是,通过添加on_data(),代码就不再起作用了。 Would someone be able to to assist please? 有人能帮忙吗? I'm also open to suggestions on the best way to store/process tweets! 我也对有关存储/处理推文的最佳方式的建议持开放态度! My end goal is to be able to track tweets based on demographic variables (eg, country, user profile age, etc) and the sentiment of particular brands (eg, Apple, HTC, Samsung). 我的最终目标是能够根据人口统计变量(例如,国家/地区,用户资料年龄等)以及特定品牌(例如,Apple,HTC,Samsung)的情绪来跟踪推文。

In addition, I would also like to try filtering tweets by location AND keywords. 此外,我还想尝试按位置和关键字过滤推文。 I've adapted the code from How to add a location filter to tweepy module separately. 我已经改编了如何将位置过滤器分别添加到tweepy模块中的代码。 However, while it works when there are a few keywords, it stops when the number of keywords grows. 但是,虽然它在有少量关键字时有效,但在关键字数量增加时会停止。 I presume my code is inefficient. 我认为我的代码效率低下。 Is there a better way of doing it? 有没有更好的方法呢?

### code to save tweets in json###
import sys
import tweepy
import json

consumer_key=" "
consumer_secret=" "
access_key = " "
access_secret = " "

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
file = open('today.txt', 'a')

class CustomStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print status.text

    def on_data(self, data):
        json_data = json.loads(data)
        file.write(str(json_data))

    def on_error(self, status_code):
        print >> sys.stderr, 'Encountered error with status code:', status_code
        return True # Don't kill the stream

    def on_timeout(self):
        print >> sys.stderr, 'Timeout...'
        return True # Don't kill the stream

sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['twitter'])

I found a way to save the tweets to a json file. 我发现了一种将推文保存到json文件的方法。 Happy to hear how it can be improved! 很高兴听到它如何改进!

# initialize blank list to contain tweets
tweets = []
# file name that you want to open is the second argument
save_file = open('9may.json', 'a')

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.save_file = tweets

    def on_data(self, tweet):
        self.save_file.append(json.loads(tweet))
        print tweet
        save_file.write(str(tweet))

In rereading your original question, I realize that you ask a lot of smaller questions. 在重新阅读你的原始问题时,我意识到你提出了许多较小的问题。 I'll try to answer most of them here but some may merit actually asking a separate question on SO. 我会尝试在这里回答大部分问题,但有些人可能真的要问一个单独的问题。

  • Why does it break with the addition of on_data ? 为什么它会因添加 on_data on_data

Without seeing the actual error, it's hard to say. 没有看到实际的错误,很难说。 It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that. 它实际上对我不起作用,直到我重新生成我的消费者/访问密钥,我试试。

There are a few things I might do differently than your answer. 我可能做的一些事情与你的答案不同。

tweets is a global list. tweets是一个全球列表。 This means that if you have multiple StreamListeners (ie in multiple threads), every tweet collected by any stream listener will be added to this list. 这意味着如果您有多个StreamListeners (即在多个线程中),则任何流侦听器收集的每条推文都将添加到此列表中。 This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean: 这是因为Python中的列表引用了内存中的位置 - 如果这令人困惑,这里是我的意思的基本示例:

>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]

Notice that even though you thought appended 7 to foo , foo and bar actually refer to the same thing (and therefore changing one changes both). 请注意,即使您认为将foo附加到foofoobar实际上也引用相同的东西(因此更改一个会改变它们)。

If you meant to do this, it's a pretty great solution. 如果您打算这样做,这是一个非常好的解决方案。 However, if your intention was to segregate tweets from different listeners, it could be a huge headache. 但是,如果您打算将推文与不同的听众分开,那可能会让您头疼不已。 I personally would construct my class like this: 我个人会这样构建我的类:

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.list_of_tweets = []

This changes the tweets list to be only in the scope of your class. 这会将推文列表更改为仅在您的类的范围内。 Also, I think it's appropriate to change the property name from self.save_file to self.list_of_tweets because you also name the file that you're appending the tweets to save_file . 此外,我认为将属性名称从self.save_file更改为self.save_fileself.list_of_tweets因为您还将要添加推文的文件命名为save_file Although this will not strictly cause an error, it's confusing to human me that self.save_file is a list and save_file is a file. 虽然这不会严格地导致错误,但是对于我而言, self.save_file是一个列表而save_file是一个文件,这让我感到困惑。 It helps future you and anyone else that reads your code figure out what the heck everything does/is. 它可以帮助您和其他任何读取您的代码的人找出一切都在做什么。 More on variable naming. 更多关于变量命名。

In my comment, I mentioned that you shouldn't use file as a variable name. 在我的评论中,我提到你不应该使用file作为变量名。 file is a Python builtin function that constructs a new object of type file . file是一个Python内置函数,用于构造file类型的新对象。 You can technically overwrite it, but it is a very bad idea to do so. 你可以在技术上覆盖它,但这样做是一个非常糟糕的主意。 For more builtins, see the Python documentation . 有关更多内置函数,请参阅Python文档

  • How do I filter results on multiple keywords? 如何过滤多个关键字的结果?

All keywords are OR 'd together in this type of search, source : 在这种类型的搜索中,所有关键字都是“ OR ”, 来源

sapi.filter(track=['twitter', 'python', 'tweepy'])

This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. 这意味着这将获得包含'twitter','python'或'tweepy'的推文。 If you want the union ( AND ) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for. 如果您想要联合( AND )所有术语,则必须通过检查要搜索的所有术语列表的推文进行后处理。

  • How do I filter results based on location AND keyword? 如何根据位置和关键字过滤结果?

I actually just realized that you did ask this as its own question as I was about to suggest. 我实际上刚刚意识到你确实问过这个问题,正如我将要提出的那样。 A regex post-processing solution is a good way to accomplish this. 正则表达式后处理解决方案是实现此目的的好方法。 You could also try filtering by both location and keyword like so: 你也可以尝试按位置关键字进行过滤,如下所示:

sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])
  • What is the best way to store/process tweets? 存储/处理推文的最佳方式是什么?

That depends on how many you'll be collecting. 这取决于你将收集多少。 I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. 我是数据库的粉丝,特别是如果你打算在很多推文上做一个情绪分析。 When you collect data, you should only collect things you will need. 收集数据时,您应该收集所需的内容。 This means, when you save results to your database/wherever in your on_data method, you should extract the important parts from the JSON and not save anything else. 这意味着,当您将结果保存到数据库/ on_data方法中的任何位置时,您应该从JSON中提取重要部分而不保存任何其他内容。 If for example you want to look at brand, country and time, only take those three things; 例如,如果您想看品牌,国家和时间,只需要考虑这三件事; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space. 不要保存推文的整个JSON转储,因为它只会占用不必要的空间。

I just insert the raw JSON into the database. 我只是将原始JSON插入数据库。 It seems a bit ugly and hacky but it does work. 它似乎有点丑陋和hacky但它​​确实有效。 A noteable problem is that the creation dates of the Tweets are stored as strings. 一个值得注意的问题是推文的创建日期存储为字符串。 How do I compare dates from Twitter data stored in MongoDB via PyMongo? 如何通过PyMongo比较存储在MongoDB中的Twitter数据的日期? provides a way to fix that (I inserted a comment in the code to indicate where one would perform that task) 提供了一种解决方法(我在代码中插入注释以指示执行该任务的位置)

# ...

client = pymongo.MongoClient()
db = client.twitter_db
twitter_collection = db.tweets

# ...

class CustomStreamListener(tweepy.StreamListener):
    # ...
    def on_status(self, status):
            try:
                twitter_json = status._json
                # TODO: Transform created_at to Date objects before insertion
                tweet_id = twitter_collection.insert(twitter_json)
            except:
                # Catch any unicode errors while printing to console
                # and just ignore them to avoid breaking application.
                pass
    # ...

stream = tweepy.Stream(auth, CustomStreamListener(), timeout=None, compression=True)
stream.sample()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM