使用 tweepy 和 tesseract 提取推文中的 img 並獲取文本

Question

我正在嘗試使用 tesseract 在我的 Twitter 監視器上實現 ocr。 我的問題是：如何從用戶那里獲取圖像並立即運行 ocr。 我正在監視某些 Twitter 帳戶的最新推文，如果有新的推文進來並且包含一個 url，我將在瀏覽器中打開它，現在我想檢查推文中是否還有圖像並在控制台中打印內容。 我的代碼如下所示：

import tweepy
import re
import webbrowser
import time
import urllib
from datetime import datetime
# a bunch of access keys
keys = [(example_keys)]

# which key is in use right now
key_index = 0
test = 0
url_store = ''



# Function to extract url from newest tweet 
def get_tweets(username, tweet_mode='extended'):
        # Authorization to consumer key and consumer secret 
        auth = tweepy.OAuthHandler(keys[key_index][0], keys[key_index][1]) 

        # Access to user's access key and access secret 
        auth.set_access_token(keys[key_index][2], keys[key_index][3]) 

        # Calling api 
        api = tweepy.API(auth) 

        # try to get latest tweet until rate limit is reached
        try:
            # Get newest tweet from profile
            tweets = api.user_timeline(screen_name=username, count=1)
            tweet = [tweet.text for tweet in tweets][0]
            print(tweet)



            global url_store
            # regex through tweet for url
            url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(tweet))

            # check if url was found and isn't the same as the url from the last tweet
            if (url!=[] and url[0]!=url_store):
                # store url in variable
                url_store=url[0]
                # open the url in webbrowser
                webbrowser.open(url[0])

                # save the html dom to a text file
                urllib.request.urlretrieve(url[0], "test.txt")

        # when rate limit is reached
        except tweepy.TweepError:
            # select the next key from array
            changeKeys() 

        # right now function always returns false
        return False


def changeKeys():
        global key_index
        # increment key_index by 1 unless end of key array is reached -> start from the beginning
        if key_index >= len(keys) - 1:
            key_index = 0
        else:
            key_index += 1

def getIMG():



# Driver code 
if __name__ == '__main__': 
    # boolean if url was found (right now its always false)
    found=False
    # never ending for loop
    while not found:
        # get tweets from specific twitter handle
        found = get_tweets("Trump",)
        time.sleep(0.02)

Answer 1

這是一個很好的問題。 您使用 RegEx 的方法是查找圖像的錯誤方法。

每條推文都包含“實體” - 請參閱https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object

您可以使用它們直接從推文中獲取圖像。

例如：

tweet.entities.urls

將為您提供推文中的所有 URL。

使用 tweepy 和 tesseract 提取推文中的 img 並獲取文本

問題描述

1 個解決方案

解決方案1
0 2020-03-18 11:44:52

使用 tweepy 和 tesseract 提取推文中的 img 並獲取文本

問題描述

1 個解決方案

解決方案1 0 2020-03-18 11:44:52

解決方案1
0 2020-03-18 11:44:52