简体   繁体   English

aiohttp:按域限制每秒请求数

[英]aiohttp: rate limiting requests-per-second by domain

I am writing a web crawler that is running parallel fetches for many different domains.我正在编写一个网络爬虫,它为许多不同的域运行并行提取。 I want to limit the number of requests-per-second that are made to each individual domain , but I do not care about the total number of connections that are open, or the total requests per second that are made across all domains.我想限制对每个域发出的每秒请求数,但我不关心打开的连接总数,或者所有域中发出的每秒请求总数。 I want to maximize the number of open connections and requests-per-second overall, while limiting the number of requests-per-second made to individual domains.我想最大限度地增加打开连接数和每秒请求数,同时限制对单个域的每秒请求数。

All of the currently existing examples I can find either (1) limit the number of open connections or (2) limit the total number of requests-per-second made in the fetch loop.我可以找到的所有当前存在的示例(1)限制打开连接的数量或(2)限制在获取循环中每秒发出的请求总数。 Examples include:例子包括:

Neither of them do what I am requesting which is to limit requests-per-second on a per domain basis.他们都没有做我所要求的,即在每个域的基础上限制每秒请求数。 The first question only answers how to limit requests-per-second overall.第一个问题只回答如何限制每秒请求数。 The second one doesn't even have answers to the actual question (the OP asks about requests per second and the answers all talk about limiting # of connections).第二个甚至没有对实际问题的答案(OP 询问每秒请求数,答案都在谈论限制连接数)。

Here is the code that I tried, using a simple rate limiter I made for a synchronous version, which doesn't work when the DomainTimer code is run in an async event loop:这是我尝试的代码,使用我为同步版本制作的简单速率限制器,当 DomainTimer 代码在异步事件循环中运行时,它不起作用:

from collections import defaultdict
from datetime import datetime, timedelta
import asyncio
import async_timeout
import aiohttp
from urllib.parse import urlparse
from queue import Queue, Empty

from HTMLProcessing import processHTML
import URLFilters

SEED_URLS = ['http://www.bbc.co.uk', 'http://www.news.google.com']
url_queue = Queue()
for u in SEED_URLS:
    url_queue.put(u)

# number of pages to download per run of crawlConcurrent()
BATCH_SIZE = 100
DELAY = timedelta(seconds = 1.0) # delay between requests from single domain, in seconds

HTTP_HEADERS = {'Referer': 'http://www.google.com', 
                'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'}


class DomainTimer():
    def __init__(self):
        self.timer = None

    def resetTimer(self):
        self.timer = datetime.now()

    def delayExceeded(self, delay):
        if not self.timer: #We haven't fetched this before
            return True
        if (datetime.now() - self.timer) >= delay:
            return True
        else:
            return False


crawl_history = defaultdict(dict) # given a URL, when is last time crawled?
domain_timers = defaultdict(DomainTimer)

async def fetch(session, url):
    domain = urlparse(url).netloc
    print('here fetching ' + url + "\n")
    dt = domain_timers[domain]

    if dt.delayExceeded(DELAY) or not dt:
        with async_timeout.timeout(10):
            try:
                dt.resetTimer() # reset domain timer
                async with session.get(url, headers=HTTP_HEADERS) as response:
                    if response.status == 200:
                        crawl_history[url] = datetime.now()
                        html = await response.text()
                        return {'url': url, 'html': html}
                    else:
                        # log HTTP response, put into crawl_history so
                        # we don't attempt to fetch again
                        print(url + " failed with response: " + str(response.status) + "\n")
                        return {'url': url, 'http_status': response.status}

            except aiohttp.ClientConnectionError as e:
                print("Connection failed " + str(e))

            except aiohttp.ClientPayloadError as e: 
                print("Recieved bad data from server @ " + url + "\n")

    else: # Delay hasn't passed yet: skip for now & put @ end of q
        url_queue.put(url);
        return None


async def fetch_all(urls):
    """Launch requests for all web pages."""
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in urls:
            task = asyncio.ensure_future(fetch(session, url))
            tasks.append(task) # create list of tasks
        return await asyncio.gather(*tasks) # gather task responses


def batch_crawl():
    """Launch requests for all web pages."""
    start_time = datetime.now()

    # Here we build the list of URLs to crawl for this batch
    urls = []
    for i in range(BATCH_SIZE):
        try:
            next_url = url_queue.get_nowait() # get next URL from queue
            urls.append(next_url)
        except Empty:
            print("Processed all items in URL queue.\n")
            break;

    loop = asyncio.get_event_loop()
    asyncio.set_event_loop(loop)  
    pages = loop.run_until_complete(fetch_all(urls))
    crawl_time = (datetime.now() - start_time).seconds
    print("Crawl completed. Fetched " + str(len(pages)) + " pages in " + str(crawl_time) + " seconds.\n")  
    return pages


def parse_html(pages):
    """ Parse the HTML for each page downloaded in this batch"""
    start_time = datetime.now()
    results = {}

    for p in pages:
        if not p or not p['html']:
            print("Received empty page")
            continue
        else:
            url, html = p['url'], p['html']
            results[url] = processHTML(html)

    processing_time = (datetime.now() - start_time).seconds
    print("HTML processing finished. Processed " + str(len(results)) + " pages in " + str(processing_time) + " seconds.\n")  
    return results


def extract_new_links(results):
    """Extract links from """
    # later we could track where links were from here, anchor text, etc, 
    # and weight queue priority  based on that
    links = []
    for k in results.keys():
        new_urls = [l['href'] for l in results[k]['links']]
        for u in new_urls:
            if u not in crawl_history.keys():
                links.append(u)
    return links

def filterURLs(urls):
    urls = URLFilters.filterDuplicates(urls)
    urls = URLFilters.filterBlacklistedDomains(urls)
    return urls

def run_batch():
    pages = batch_crawl()
    results = parse_html(pages)
    links = extract_new_links(results)
    for l in filterURLs(links):
        url_queue.put(l)

    return results

There are no errors or exceptions thrown, and the rate-limiting code works fine in for synchronous fetches, but the DomainTimer has no apparent effect when run in async loop.没有抛出错误或异常,限速代码在同步提取中工作正常,但 DomainTimer 在异步循环中运行时没有明显影响。 The delay of one request-per-second per domain is not upheld...不支持每个域每秒一个请求的延迟......

How would I modify this synchronous rate limiting code to work within the async event loop?我将如何修改此同步速率限制代码以在异步事件循环中工作? Thanks!谢谢!

It's hard to debug your code since it contains many unrelated stuff, it's easier to show idea on a new simple example. 调试代码很难,因为它包含许多不相关的东西,更容易在一个新的简单示例上显示想法。

Main idea: 大意:

  • write your Semaphore -like class using __aenter__ , __aexit__ that accepts url (domain) 使用接受url(domain)的__aenter____aexit__编写__aenter__ Semaphore的类
  • use domain-specific Lock to prevent multiple requests to the same domain 使用特定于域的Lock来防止对同一域的多个请求
  • sleep before allowing next request according to domain's last request and RPS 根据域的最后一个请求和RPS允许下一个请求之前休眠
  • track time of last request for each domain 跟踪每个域的上次请求的时间

Code: 码:

import asyncio
import aiohttp
from urllib.parse import urlparse
from collections import defaultdict


class Limiter:
    # domain -> req/sec:
    _limits = {
        'httpbin.org': 4,
        'eu.httpbin.org': 1,
    }

    # domain -> it's lock:
    _locks = defaultdict(lambda: asyncio.Lock())

    # domain -> it's last request time
    _times = defaultdict(lambda: 0)

    def __init__(self, url):
        self._host = urlparse(url).hostname

    async def __aenter__(self):
        await self._lock

        to_wait = self._to_wait_before_request()
        print(f'Wait {to_wait} sec before next request to {self._host}')
        await asyncio.sleep(to_wait)

    async def __aexit__(self, *args):        
        print(f'Request to {self._host} just finished')

        self._update_request_time()
        self._lock.release()

    @property
    def _lock(self):
        """Lock that prevents multiple requests to same host."""
        return self._locks[self._host]

    def _to_wait_before_request(self):
        """What time we need to wait before request to host."""
        request_time = self._times[self._host]
        request_delay = 1 / self._limits[self._host]
        now = asyncio.get_event_loop().time()
        to_wait = request_time + request_delay - now
        to_wait = max(0, to_wait)
        return to_wait

    def _update_request_time(self):
        now = asyncio.get_event_loop().time()
        self._times[self._host] = now


# request that uses Limiter instead of Semaphore:
async def get(url):
    async with Limiter(url):
        async with aiohttp.ClientSession() as session:  # TODO reuse session for different requests.
            async with session.get(url) as resp:
                return await resp.text()


# main:
async def main():
    coros = [
        get('http://httpbin.org/get'),
        get('http://httpbin.org/get'),
        get('http://httpbin.org/get'),
        get('http://httpbin.org/get'),
        get('http://httpbin.org/get'),
        get('http://eu.httpbin.org/get'),
        get('http://eu.httpbin.org/get'),
        get('http://eu.httpbin.org/get'),
        get('http://eu.httpbin.org/get'),
        get('http://eu.httpbin.org/get'),
    ]

    await asyncio.gather(*coros)


if __name__ ==  '__main__':
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(main())
    finally:
        loop.run_until_complete(loop.shutdown_asyncgens())
        loop.close()

I developed a library named octopus-api ( https://pypi.org/project/octopus-api/ ), that enables you to rate limit and set the number of connections to the endpoint using aiohttp under the hood.我开发了一个名为 octopus-api ( https://pypi.org/project/octopus-api/ ) 的库,它使您能够在后台使用 aiohttp 限制速率并设置到端点的连接数。 The goal of it is to simplify all the aiohttp setup needed.它的目标是简化所有需要的 aiohttp 设置。

Here is an example of how to use it, where the get_ethereum is the user-defined request function.下面是一个如何使用它的示例,其中get_ethereum是用户定义的请求函数。 It could have also been a web crawler function request or what ever fits:它也可能是网络爬虫功能请求或适合的内容:

from octopus_api import TentacleSession, OctopusApi
from typing import Dict, List

if __name__ == '__main__':
    async def get_ethereum(session: TentacleSession, request: Dict):
        async with session.get(url=request["url"], params=request["params"]) as response:
            body = await response.json()
            return body

    client = OctopusApi(rate=50, resolution="sec", connections=6)
    result: List = client.execute(requests_list=[{
        "url": "https://api.pro.coinbase.com/products/ETH-EUR/candles?granularity=900&start=2021-12-04T00:00:00Z&end=2021-12-04T00:00:00Z",
        "params": {}}] * 1000, func=get_ethereum)
    print(result)

The TentacleSession works the same as how you write POST, GET, PUT and PATCH for aiohttp.ClientSession. TentacleSession 的工作方式与您为 aiohttp.ClientSession 编写 POST、GET、PUT 和 PATCH 的方式相同。

Let me know if it helps your issue related to rate limits and connection for crawling.让我知道它是否有助于您解决与速率限制和爬行连接相关的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM