![](/img/trans.png)
[英]How to get all tweets (more than 100) and associated user fields in python using twitter search API v2 and Tweepy?
[英]How do I loop my python code for Twitter API v2 recent search?
我對 python 很陌生,所以我正在尋求解決這個問題的幫助。 我的目標是收集大約 10,000 條包含圖像的推文,並將其保存到 csv 文件中。 由於 Twitter 的速率限制是每 15 分鍾 450 個請求,理想情況下我想自動化這個過程。 我看到的指南只使用了 tweepy 模塊,但由於我不太了解它,所以我使用了 Twitter 上給出的示例 python 代碼:
import requests
import pandas as pd
import os
import json
# To set your enviornment variables in your terminal run the following line:
os.environ['BEARER_TOKEN']=''
def auth():
return os.environ.get("BEARER_TOKEN")
def create_url():
query = "has:images lang:en -is:retweet"
tweet_fields = "tweet.fields=attachments,created_at,author_id"
expansions = "expansions=attachments.media_keys"
media_fields = "media.fields=media_key,preview_image_url,type,url"
max_results = "max_results=100"
url = "https://api.twitter.com/2/tweets/search/recent?query={}&{}&{}&{}&{}".format(
query, tweet_fields, expansions, media_fields, max_results
)
return url
def create_headers(bearer_token):
headers = {"Authorization": "Bearer {}".format(bearer_token)}
return headers
def connect_to_endpoint(url, headers):
response = requests.request("GET", url, headers=headers)
print(response.status_code)
if response.status_code != 200:
raise Exception(response.status_code, response.text)
return response.json()
def save_json(file_name, file_content):
with open(file_name, 'w', encoding='utf-8') as write_file:
json.dump(file_content, write_file, sort_keys=True, ensure_ascii=False, indent=4)
def main():
bearer_token = auth()
url = create_url()
headers = create_headers(bearer_token)
json_response = connect_to_endpoint(url, headers)
#Save the data as a json file
#save_json('collected_tweets.json', json_response)
#save tweets as csv
#df = pd.json_normalize(data=json_response)
df1 = pd.DataFrame(json_response['data'])
df1.to_csv('tweets_data.csv', mode="a")
df2 = pd.DataFrame(json_response['includes'])
df2.to_csv('tweets_includes_media.csv', mode="a")
print(json.dumps(json_response['meta'], sort_keys=True, indent=4))
if __name__ == "__main__":
main()
我應該如何更改此代碼以使其在 Twitter 的 v2 速率限制內循環,或者使用 tweepy 會更好嗎?
作為旁注,我確實意識到我的代碼保存為 csv 有問題,但這是我現在能做的最好的。
這里有幾件事要記住。
"has:images lang:en -is:retweet"
)和只是實時收集這些推文。 如果您嘗試獲取兩個時間段之間的非轉推英文推文的完整記錄,您需要將這些時間點添加到您的查詢中,然后按照您上面的要求管理限制。 查看API 參考文檔中的start_time
和end_time
。注意:要在后台運行腳本,請編寫程序,然后從終端使用
nohup nameofstreamingcode.py > logfile.log 2>&1 &
執行它。 任何普通終端 output(即打印行和/或錯誤)都將被寫入一個名為logfile.log
的新文件,命令末尾的&
使進程在后台運行(因此您可以關閉終端並稍后再回來)。
connect_to_endpoint(url, headers)
function 添加大量數據。 Also, you can use another function pause_until
, written for a Twitter V2 API package I am in the process of developing ( link to function code ).def connect_to_endpoint(url, headers):
response = requests.request("GET", url, headers=headers)
# Twitter returns (in the header of the request object) how many
# requests you have left. Lets use this to our advantage
remaining_requests = int(response.headers["x-rate-limit-remaining"])
# If that number is one, we get the reset-time
# and wait until then, plus 15 seconds (your welcome Twitter).
# The regular 429 exception is caught below as well,
# however, we want to program defensively, where possible.
if remaining_requests == 1:
buffer_wait_time = 15
resume_time = datetime.fromtimestamp( int(response.headers["x-rate-limit-reset"]) + buffer_wait_time )
print(f"Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time) ## Link to this code in above answer
# We still may get some weird errors from Twitter.
# We only care about the time dependent errors (i.e. errors
# that Twitter wants us to wait for).
# Most of these errors can be solved simply by waiting
# a little while and pinging Twitter again - so that's what we do.
if response.status_code != 200:
# Too many requests error
if response.status_code == 429:
buffer_wait_time = 15
resume_time = datetime.fromtimestamp( int(response.headers["x-rate-limit-reset"]) + buffer_wait_time )
print(f"Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time) ## Link to this code in above answer
# Twitter internal server error
elif response.status_code == 500:
# Twitter needs a break, so we wait 30 seconds
resume_time = datetime.now().timestamp() + 30
print(f"Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time) ## Link to this code in above answer
# Twitter service unavailable error
elif response.status_code == 503:
# Twitter needs a break, so we wait 30 seconds
resume_time = datetime.now().timestamp() + 30
print(f"Waiting on Twitter.\n\tResume Time: {resume_time}")
pause_until(resume_time) ## Link to this code in above answer
# If we get this far, we've done something wrong and should exit
raise Exception(
"Request returned an error: {} {}".format(
response.status_code, response.text
)
)
# Each time we get a 200 response, lets exit the function and return the response.json
if response.ok:
return response.json()
由於完整的查詢結果將遠大於您在每次查詢時請求的 100 條推文,因此您需要在更大的查詢中跟蹤您的位置。 這是通過next_token
完成的。
要獲得next_token
,其實很簡單。 只需從響應中的元字段中獲取它。 要清楚,您可以像這樣使用上面的 function ...
# Get response
response = connect_to_endpoint(url, headers)
# Get next_token
next_token = response["meta"]["next_token"]
然后需要在查詢詳細信息中傳遞此令牌,這些詳細信息包含在您使用create_url()
function 創建的 url 中。 這意味着您還需要將您的create_url()
function 更新為如下所示...
def create_url(pagination_token=None):
query = "has:images lang:en -is:retweet"
tweet_fields = "tweet.fields=attachments,created_at,author_id"
expansions = "expansions=attachments.media_keys"
media_fields = "media.fields=media_key,preview_image_url,type,url"
max_results = "max_results=100"
if pagination_token == None:
url = "https://api.twitter.com/2/tweets/search/recent?query={}&{}&{}&{}&{}".format(
query, tweet_fields, expansions, media_fields, max_results
)
else:
url = "https://api.twitter.com/2/tweets/search/recent?query={}&{}&{}&{}&{}&{}".format(
query, tweet_fields, expansions, media_fields, max_results, pagination_token
)
return url
更改上述功能后,您的代碼應按以下方式流動。
next_token
response["meta"]["next_token"]
next_token
和create_url()
最后一點:我不會嘗試使用 pandas 數據幀來編寫您的文件。 我將創建一個空列表 append 將每個新查詢的結果寫入該列表,然后將字典對象的最終列表寫入 json 文件(有關詳細信息,請參閱此問題)。 我已經了解到原始推文和 pandas 數據幀並不能很好地發揮作用。 更好地習慣 json 對象和字典的工作方式。
嘗試使用調度程序:
import sched
import time
scheduler = sched.scheduler(time.time, time.sleep)
scheduler.enter(delay=16 * 60, priority=1, action=connect_to_endpoint)
延遲
是兩個事件之間的時間量。
行動
是每 16 分鍾執行一次的方法(在本例中)。
考慮重復的確切時間和確切方法。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.