[英]Using tweepy and tesseract to extract img in a tweet and getting the text
我正在嘗試使用 tesseract 在我的 Twitter 監視器上實現 ocr。 我的問題是:如何從用戶那里獲取圖像並立即運行 ocr。 我正在監視某些 Twitter 帳戶的最新推文,如果有新的推文進來並且包含一個 url,我將在瀏覽器中打開它,現在我想檢查推文中是否還有圖像並在控制台中打印內容。 我的代碼如下所示:
import tweepy
import re
import webbrowser
import time
import urllib
from datetime import datetime
# a bunch of access keys
keys = [(example_keys)]
# which key is in use right now
key_index = 0
test = 0
url_store = ''
# Function to extract url from newest tweet
def get_tweets(username, tweet_mode='extended'):
# Authorization to consumer key and consumer secret
auth = tweepy.OAuthHandler(keys[key_index][0], keys[key_index][1])
# Access to user's access key and access secret
auth.set_access_token(keys[key_index][2], keys[key_index][3])
# Calling api
api = tweepy.API(auth)
# try to get latest tweet until rate limit is reached
try:
# Get newest tweet from profile
tweets = api.user_timeline(screen_name=username, count=1)
tweet = [tweet.text for tweet in tweets][0]
print(tweet)
global url_store
# regex through tweet for url
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(tweet))
# check if url was found and isn't the same as the url from the last tweet
if (url!=[] and url[0]!=url_store):
# store url in variable
url_store=url[0]
# open the url in webbrowser
webbrowser.open(url[0])
# save the html dom to a text file
urllib.request.urlretrieve(url[0], "test.txt")
# when rate limit is reached
except tweepy.TweepError:
# select the next key from array
changeKeys()
# right now function always returns false
return False
def changeKeys():
global key_index
# increment key_index by 1 unless end of key array is reached -> start from the beginning
if key_index >= len(keys) - 1:
key_index = 0
else:
key_index += 1
def getIMG():
# Driver code
if __name__ == '__main__':
# boolean if url was found (right now its always false)
found=False
# never ending for loop
while not found:
# get tweets from specific twitter handle
found = get_tweets("Trump",)
time.sleep(0.02)
這是一個很好的問題。 您使用 RegEx 的方法是查找圖像的錯誤方法。
每條推文都包含“實體” - 請參閱https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object
您可以使用它們直接從推文中獲取圖像。
例如:
tweet.entities.urls
將為您提供推文中的所有 URL。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.