[英]Mining 7 days worth of Tweets from Twitter API with a certain hashtag using Python Tweepy
I am mining 7 days worth of Tweets from the Twitter API using Python, Tweepy, Django, Celery, Django REST framework. 我正在使用Python,Tweepy,Django,Celery,Django REST框架从Twitter API挖掘7天的Tweets。
I am sending a request every minute using a celery beat and storing the collected data to a Postgresql database using the Django ORM. 我使用芹菜节拍每分钟发送一个请求,并使用Django ORM将收集的数据存储到Postgresql数据库。
To ensure that the api doesn't keep sending the same 100 tweets with each call, I am checking the database for the min(tweet.id)
, and setting that as max_id
parameter before each new request. 为了确保api不会在每次调用时都发送相同的100条tweet,我在数据库中检查
min(tweet.id)
,并将其设置为每个新请求之前的max_id
参数。
I ran into a problem: once I get 7 days worth of tweets, how do I reset this max_id
. 我遇到一个问题:一旦获得了7天的推文,我该如何重置此
max_id
。
class Tweet(models.Model):
tweet_id = models.CharField(
max_length=200,
unique=True,
primary_key=True
)
tweet_date = models.DateTimeField()
tweet_source = models.TextField()
tweet_favorite_cnt = models.CharField(max_length=200)
tweet_retweet_cnt = models.CharField(max_length=200)
tweet_text = models.TextField()
def __str__(self):
return self.tweet_id + ' | ' + str(self.tweet_date)
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
# Instantiate an instance of the API class from the tweepy library.
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
@shared_task(name='cleanup')
def cleanup():
"""
Check database for records older than 7 days.
Delete them if they exist.
"""
Tweet.objects.filter(tweet_date__lte=datetime.now() - timedelta(days=7)).delete()
@shared_task(name='get_tweets')
def get_tweets():
"""Get some tweets from the twitter api and store them to the db."""
# Subtasks
chain = cleanup.s()
chain()
# Check for the minimum tweet_id and set it as max_id.
# This ensures the API call doesn't keep getting the same tweets.
max_id = min([tweet.tweet_id for tweet in Tweet.objects.all()])
# Make the call to the Twitter Search API.
tweets = api.search(
q='#python',
max_id=max_id,
count=100
)
# Store the collected data into lists.
tweets_date = [tweet.created_at for tweet in tweets]
tweets_id = [tweet.id for tweet in tweets]
tweets_source = [tweet.source for tweet in tweets]
tweets_favorite_cnt = [tweet.favorite_count for tweet in tweets]
tweets_retweet_cnt = [tweet.retweet_count for tweet in tweets]
tweets_text = [tweet.text for tweet in tweets]
# Iterate over these lists and save the items as fields for new records in the database.
for i, j, k, l, m, n in zip(
tweets_id,
tweets_date,
tweets_source,
tweets_favorite_cnt,
tweets_retweet_cnt,
tweets_text
):
try:
Tweet.objects.create(
tweet_id=i,
tweet_date=j,
tweet_source=k,
tweet_favorite_cnt=l,
tweet_retweet_cnt=m,
tweet_text=n,
)
except IntegrityError:
pass
Try this: 尝试这个:
# Check for the minimum tweet_id and set it as max_id.
# This ensures the API call doesn't keep getting the same tweets.
date_partition = get_seven_day_partition
## Since you're cutting off every seven days, you should know how
## to separate your weeks into seven day sections
max_id = min([tweet.tweet_id for tweet in Tweet.objects.all()
if tweet.tweet_date > date_partition])
You didn't specify enough info about how you're pulling these tweets and how you know to stop at a certain date (and the execution of this program) so it's hard to advise a proper way of keeping track of date. 您没有指定有关如何拉动这些推文以及如何知道在特定日期停止(以及该程序的执行)的足够信息,因此很难建议一种适当的跟踪日期的方法。
What I can tell you is, set date_partition
accordingly for your use case and this addition to the max_id
assignment will properly grab the max day for the oldest 7 day period 我可以告诉您的是,为您的用例相应地设置
date_partition
,并且在max_id
分配中添加的内容将正确地获取最旧的7天时间段的最大天数
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.