简体   繁体   English

Twitter 爬虫速率限制

[英]Twitter Scraper Rate Limit

I am trying to scrape all the "Following" account information (Username, Website, Last Tweet Date) of a certain account.我正在尝试抓取某个帐户的所有“关注”帐户信息(用户名、网站、上次推文日期)。 For example https://www.twitter.com/verified/following .例如https://www.twitter.com/verified/following As you may see, it has 365.7K Following usernames.如您所见,它有 365.7K 以下用户名。

I scraped the usernames and now I have to go to all the links and scrape that data.我抓取了用户名,现在我必须访问所有链接并抓取该数据。 The code works fine, it gets all the information needed, but after a certain number of link visits, Twitter says I exceeded the Rate Limit and it stops showing any information about the account I visit.代码运行良好,它获得了所有需要的信息,但是在访问一定次数的链接后,Twitter 说我超出了速率限制,并且停止显示有关我访问的帐户的任何信息。

def get_user_info(user):
    """Gets User Info - Username, Website, Last Tweet Date"""
    driver.get(user[0])
    sleep(1)
    username = '@' + user[0].split('/')[-1]
    attempt = 0
    while True:
        try:
            website = driver.find_element_by_xpath("//div[@data-testid='UserProfileHeader_Items']/a").get_attribute('href')
        except NoSuchElementException:
            website = 'No Website'
            attempt += 1
            sleep(1)
        try:
            last_tweet_date = driver.find_element_by_xpath("//time").get_attribute('datetime')
        except NoSuchElementException:
            last_tweet_date = 'No Tweets'
            attempt += 1
            sleep(1)
        if website != 'No Website' and last_tweet_date != 'No Tweets':
            break
        if attempt > 1:
            break

    info = (username, website, last_tweet_date)
    return info

def user_info():
    info_list = []
    users_df = pd.read_csv('UserLinks.csv')
    user_list = users_df.values.tolist()
    for user in user_list:
        info = get_user_info(user)
        info_list.append(info)

    info_df = pd.DataFrame(info_list, columns=['Username', 'Website', 'Last Tweet Date'])
    info_df.to_csv('List2.csv', index=False)

What do you suggest?你有什么建议?

Here's my answer to a similar question on rate limits:这是我对有关速率限制的类似问题的回答:

How Rate Limit Works in Twitter Twitter 中的速率限制是如何工作的

Essentially, every API has a rate limit that renews in a certain timeframe.本质上,每个 API 都有一个在特定时间范围内更新的速率限制。 eg 15 minutes.例如15分钟。 So, you need to watch the rate limit headers or keep count yourself.因此,您需要查看速率限制标题或自己计算。 When you get to the rate limit, pause your application and start again on the next rate limit window.当您达到速率限制时,暂停您的应用程序并在下一个速率限制窗口重新开始。 Some APIs have a count parameter and you'll want to make sure you set that to max to get the most responses per request.某些 API 具有计数参数,您需要确保将其设置为 max 以获得每个请求的最多响应。 Also, Application auth typically gets more requests than User auth, if it's available for a given API call.此外,如果应用程序身份验证可用于给定的 API 调用,则应用程序身份验证通常会比用户身份验证获得更多请求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM