簡體   English   中英

使用 tweepy 和 tesseract 提取推文中的 img 並獲取文本

[英]Using tweepy and tesseract to extract img in a tweet and getting the text

我正在嘗試使用 tesseract 在我的 Twitter 監視器上實現 ocr。 我的問題是:如何從用戶那里獲取圖像並立即運行 ocr。 我正在監視某些 Twitter 帳戶的最新推文,如果有新的推文進來並且包含一個 url,我將在瀏覽器中打開它,現在我想檢查推文中是否還有圖像並在控制台中打印內容。 我的代碼如下所示:

import tweepy
import re
import webbrowser
import time
import urllib
from datetime import datetime
# a bunch of access keys
keys = [(example_keys)]

# which key is in use right now
key_index = 0
test = 0
url_store = ''



# Function to extract url from newest tweet 
def get_tweets(username, tweet_mode='extended'):
        # Authorization to consumer key and consumer secret 
        auth = tweepy.OAuthHandler(keys[key_index][0], keys[key_index][1]) 

        # Access to user's access key and access secret 
        auth.set_access_token(keys[key_index][2], keys[key_index][3]) 

        # Calling api 
        api = tweepy.API(auth) 

        # try to get latest tweet until rate limit is reached
        try:
            # Get newest tweet from profile
            tweets = api.user_timeline(screen_name=username, count=1)
            tweet = [tweet.text for tweet in tweets][0]
            print(tweet)



            global url_store
            # regex through tweet for url
            url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(tweet))

            # check if url was found and isn't the same as the url from the last tweet
            if (url!=[] and url[0]!=url_store):
                # store url in variable
                url_store=url[0]
                # open the url in webbrowser
                webbrowser.open(url[0])

                # save the html dom to a text file
                urllib.request.urlretrieve(url[0], "test.txt")

        # when rate limit is reached
        except tweepy.TweepError:
            # select the next key from array
            changeKeys() 

        # right now function always returns false
        return False


def changeKeys():
        global key_index
        # increment key_index by 1 unless end of key array is reached -> start from the beginning
        if key_index >= len(keys) - 1:
            key_index = 0
        else:
            key_index += 1

def getIMG():



# Driver code 
if __name__ == '__main__': 
    # boolean if url was found (right now its always false)
    found=False
    # never ending for loop
    while not found:
        # get tweets from specific twitter handle
        found = get_tweets("Trump",)
        time.sleep(0.02)

這是一個很好的問題。 您使用 RegEx 的方法是查找圖像的錯誤方法。

每條推文都包含“實體” - 請參閱https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object

您可以使用它們直接從推文中獲取圖像。

例如:

tweet.entities.urls

將為您提供推文中的所有 URL。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM