简体   繁体   English

aiohttp:限制并行请求的速率

[英]aiohttp: rate limiting parallel requests

APIs often have rate limits that users have to follow. API 通常有用户必须遵守的速率限制。 As an example let's take 50 requests/second.例如,让我们以 50 个请求/秒为例。 Sequential requests take 0.5-1 second and thus are too slow to come close to that limit.顺序请求需要 0.5-1 秒,因此太慢而无法接近该限制。 Parallel requests with aiohttp, however, exceed the rate limit.但是,使用 aiohttp 的并行请求超出了速率限制。

To poll the API as fast as allowed, one needs to rate limit parallel calls.为了尽可能快地轮询 API,需要限制并行调用的速率。

Examples that I found so far decorate session.get , approximately like so:到目前为止,我发现的示例装饰session.get ,大致如下:

session.get = rate_limited(max_calls_per_second)(session.get)

This works well for sequential calls.这适用于顺序调用。 Trying to implement this in parallel calls does not work as intended.尝试在并行调用中实现这一点无法按预期工作。

Here's some code as example:这里有一些代码作为例子:

async with aiohttp.ClientSession() as session:
    session.get = rate_limited(max_calls_per_second)(session.get)
    tasks = (asyncio.ensure_future(download_coroutine(  
          timeout, session, url)) for url in urls)
    process_responses_function(await asyncio.gather(*tasks))

The problem with this is that it will rate-limit the queueing of the tasks.这样做的问题是它会对任务的排队进行速率限制。 The execution with gather will still happen more or less at the same time.使用gather的执行仍或多或少会同时发生。 Worst of both worlds ;-).两全其美;-)。

Yes, I found a similar question right here aiohttp: set maximum number of requests per second , but neither replies answer the actual question of limiting the rate of requests.是的,我在这里发现了一个类似的问题aiohttp: set maximum number of requests per second ,但两个回复都没有回答限制请求率的实际问题。 Also the blog post from Quentin Pradet works only on rate-limiting the queueing.此外,来自 Quentin Pradet 的博客文章仅适用于限制排队的速率。

To wrap it up: How can one limit the number of requests per second for parallel aiohttp requests?要包起来:一个如何限制每秒请求数并行的数量aiohttp请求?

If I understand you well, you want to limit the number of simultaneous requests?如果我理解你,你想限制同时请求的数量?

There is a object inside asyncio named Semaphore , it works like an asynchronous RLock . asyncio有一个名为Semaphore的对象,它的工作方式类似于异步RLock

semaphore = asyncio.Semaphore(50)
#...
async def limit_wrap(url):
    async with semaphore:
        # do what you want
#...
results = asyncio.gather([limit_wrap(url) for url in urls])

updated更新

Suppose I make 50 concurrent requests, and they all finish in 2 seconds.假设我发出 50 个并发请求,它们都在 2 秒内完成。 So, it doesn't touch the limitation(only 25 requests per seconds).所以,它没有触及限制(每秒只有 25 个请求)。

That means I should make 100 concurrent requests, and they all finish in 2 seconds too(50 requests per seconds).这意味着我应该发出 100 个并发请求,它们也都在 2 秒内完成(每秒 50 个请求)。 But before you actually make those requests, how could you determine how long will they finish?但是在您实际提出这些请求之前,您如何确定它们将完成多长时间?

Or if you doesn't mind finished requests per second but requests made per second .或者,如果您不介意每秒完成的请求数,而是每秒发出的请求数 You can:你可以:

async def loop_wrap(urls):
    for url in urls:
        asyncio.ensure_future(download(url))
        await asyncio.sleep(1/50)

asyncio.ensure_future(loop_wrap(urls))
loop.run_forever()

The code above will create a Future instance every 1/50 second.上面的代码将每1/50秒创建一个Future实例。

I approached the problem by creating a subclass of aiohttp.ClientSession() with a ratelimiter based on the leaky-bucket algorithm.我通过使用基于漏桶算法的aiohttp.ClientSession()创建aiohttp.ClientSession()的子类来解决这个问题。 I use asyncio.Queue() for ratelimiting instead of Semaphores .我使用asyncio.Queue()而不是Semaphores进行asyncio.Queue() I've only overridden the _request() method.我只覆盖了_request()方法。 I find this approach cleaner since you only replace session = aiohttp.ClientSession() with session = ThrottledClientSession(rate_limit=15) .我发现这种方法更清晰,因为您只将session = aiohttp.ClientSession()替换为session = ThrottledClientSession(rate_limit=15)

class ThrottledClientSession(aiohttp.ClientSession):
        """Rate-throttled client session class inherited from aiohttp.ClientSession)""" 
    MIN_SLEEP = 0.1

    def __init__(self, rate_limit: float =None, *args,**kwargs) -> None: 
        super().__init__(*args,**kwargs)
        self.rate_limit = rate_limit
        self._fillerTask = None
        self._queue = None
        self._start_time = time.time()
        if rate_limit != None:
            if rate_limit <= 0:
                raise ValueError('rate_limit must be positive')
            self._queue = asyncio.Queue(min(2, int(rate_limit)+1))
            self._fillerTask = asyncio.create_task(self._filler(rate_limit))

     
    def _get_sleep(self) -> list:
        if self.rate_limit != None:
            return max(1/self.rate_limit, self.MIN_SLEEP)
        return None
        
    async def close(self) -> None:
        """Close rate-limiter's "bucket filler" task"""
        if self._fillerTask != None:
            self._fillerTask.cancel()
        try:
            await asyncio.wait_for(self._fillerTask, timeout= 0.5)
        except asyncio.TimeoutError as err:
            print(str(err))
        await super().close()


    async def _filler(self, rate_limit: float = 1):
        """Filler task to fill the leaky bucket algo"""
        try:
            if self._queue == None:
                return 
            self.rate_limit = rate_limit
            sleep = self._get_sleep()
            updated_at = time.monotonic()
            fraction = 0
            extra_increment = 0
            for i in range(0,self._queue.maxsize):
                self._queue.put_nowait(i)
            while True:
                if not self._queue.full():
                    now = time.monotonic()
                    increment = rate_limit * (now - updated_at)
                    fraction += increment % 1
                    extra_increment = fraction // 1
                    items_2_add = int(min(self._queue.maxsize - self._queue.qsize(), int(increment) + extra_increment))
                    fraction = fraction % 1
                    for i in range(0,items_2_add):
                        self._queue.put_nowait(i)
                    updated_at = now
                await asyncio.sleep(sleep)
        except asyncio.CancelledError:
            print('Cancelled')
        except Exception as err:
            print(str(err))


    async def _allow(self) -> None:
        if self._queue != None:
            # debug 
            #if self._start_time == None:
            #    self._start_time = time.time()
            await self._queue.get()
            self._queue.task_done()
        return None


    async def _request(self, *args,**kwargs):
        """Throttled _request()"""
        await self._allow()
        return await super()._request(*args,**kwargs)
    ```

I liked @sraw's approached this with asyncio, but their answer didn't quite cut it for me.我喜欢 @sraw 用 asyncio 解决这个问题,但他们的回答对我来说并没有完全切合实际。 Since I don't know if my calls to download are going to each be faster or slower than the rate limit I want to have the option to run many in parallel when requests are slow and run one at a time when requests are very fast so that I'm always right at the rate limit.因为我不知道我的下载调用是否会比速率限制更快或更慢,所以我希望可以选择在请求缓慢时并行运行多个请求,并在请求非常快时一次运行一个我总是正确的速率限制。

I do this by using a queue with a producer that produces new tasks at the rate limit, then many consumers that will either all wait on the next job if they're fast, or there will be work backed up in the queue if they are slow, and will run as fast as the processor/network allow:我通过使用一个带有生产者的队列来以速率限制生成新任务,然后许多消费者要么全部等待下一个工作,如果他们很快,要么在队列中备份工作,如果他们是慢,并且会以处理器/网络允许的速度运行:

import asyncio
from datetime import datetime 

async def download(url):
  # download or whatever
  task_time = 1/10
  await asyncio.sleep(task_time)
  result = datetime.now()
  return result, url

async def producer_fn(queue, urls, max_per_second):
  for url in urls:
    await queue.put(url)
    await asyncio.sleep(1/max_per_second)
 
async def consumer(work_queue, result_queue):
  while True:
    url = await work_queue.get()
    result = await download(url)
    work_queue.task_done()
    await result_queue.put(result)

urls = range(20)
async def main():
  work_queue = asyncio.Queue()
  result_queue = asyncio.Queue()

  num_consumer_tasks = 10
  max_per_second = 5
  consumers = [asyncio.create_task(consumer(work_queue, result_queue))
               for _ in range(num_consumer_tasks)]    
  producer = asyncio.create_task(producer_fn(work_queue, urls, max_per_second))
  await producer

  # wait for the remaining tasks to be processed
  await work_queue.join()
  # cancel the consumers, which are now idle
  for c in consumers:
    c.cancel()

  while not result_queue.empty():
    result, url = await result_queue.get()
    print(f'{url} finished at {result}')
 
asyncio.run(main())

I think I have part of the solution.我想我有一部分解决方案。 Here I am filling a queue up with urls then spawning 25 coroutines that each loop over the queue.在这里,我用 url 填充一个队列,然后产生 25 个协程,每个协程在队列上循环。 I am making requests against scraperAPI.com and they have a 25 concurrent thread limit, so I use a semaphore.我正在向 scraperAPI.com 发出请求,并且它们有 25 个并发线程限制,因此我使用了信号量。 One of the problems, like you mention, is that when you run gather() it executes everything at once.正如您提到的,问题之一是当您运行 gather() 时,它会立即执行所有内容。 What you need to do is use create_task with an await before each task so that they don't all get executed at once.您需要做的是在每个任务之前使用带有等待的 create_task ,这样它们就不会同时被执行。 Any task created with create_task is immediately run.使用 create_task 创建的任何任务都会立即运行。

notes: I realize I create a clientsession per call, if you want to only create one per loop then you can put the async with ClientSession(connector=aiohttp.TCPConnector(limit=1), timeout=ClientTimeout(total=40), raise_for_status=True) as session: in the getData() function before the while, for example.注意:我意识到我每次调用都会创建一个客户端会话,如果您只想在每个循环中创建一个,那么您可以将async with ClientSession(connector=aiohttp.TCPConnector(limit=1), timeout=ClientTimeout(total=40), raise_for_status=True) as session:例如在 while 之前的 getData() 函数中。

With this setup I can run 25 coroutines and not violate the 25 concurrent connection rule.通过这种设置,我可以运行 25 个协程,并且不会违反 25 个并发连接规则。

As far as the question the key here is the await asyncio.sleep(1.1) before every create_task() call in async_payload_wrapper().就问题而言,这里的关键是在await asyncio.sleep(1.1)每次 create_task() 调用之前的await asyncio.sleep(1.1) )。

async def send_request(sem, url, routine):
    start_time = time.time()
    print(f"{routine}, sending request: {datetime.now()}")
    params = {
                'api_key': 'nunya',
                'url': '%s' % url, 
                'render_js': 'false',
                'premium_proxy': 'false', 
                'country_code':'us'
            }
    try:
        async with sem:
            async with ClientSession(connector=aiohttp.TCPConnector(limit=1), timeout=ClientTimeout(total=40), raise_for_status=True) as session:
                async with session.get(url='http://api.scraperapi.com',params=params,) as response:              
                    data = await response.content.read()                     
                    print(f"{routine}, done request: {time.time() - start_time} seconds")                    
            return data

    except Exception as e:
        
        errors.append(url)
        print(f"here is the error: {e}, {datetime.now()}, {routine} {url}, {time.time() - start_time} seconds") 

async def getData(sem, q, test):
    while True:
        if not q.empty():
            url = q.get_nowait()
            resp = await send_request(sem, url ,test)
            ##non async call which parses the data         
            processData(resp, test, url)
        else:
            print('done')
            break

async def async_payload_wrapper(sem):
    tasks = []
    q = asyncio.Queue()
    for url in urls:
        await q.put(url)  

    for i in range(THREADS):
        await asyncio.sleep(1.1)
        tasks.append(
            asyncio.create_task(getData(sem, q, ''.join(random.choice(string.ascii_lowercase) for i in range(10))))
        )
    await asyncio.gather(*tasks)



if __name__ == '__main__':
    sem = asyncio.Semaphore(25)
    asyncio.run(async_payload_wrapper(sem))

I developed a library named octopus-api ( https://pypi.org/project/octopus-api/ ), that enables you to rate limit and set the number of concurrent api calls to the endpoint using aiohttp under the hood.我开发了一个名为 octopus-api ( https://pypi.org/project/octopus-api/ ) 的库,它使您能够在后台使用 aiohttp 限制速率并设置对端点的并发 api 调用数。 The goal of it is to simplify all the aiohttp setup needed.它的目标是简化所有需要的 aiohttp 设置。

Here is an example of how to use it, where the get_ethereum is the user-defined request function:下面是一个如何使用它的例子,其中 get_ethereum 是用户定义的请求函数:

from octopus_api import TentacleSession, OctopusApi
from typing import Dict, List

if __name__ == '__main__':
    async def get_ethereum(session: TentacleSession, request: Dict):
        async with session.get(url=request["url"], params=request["params"]) as response:
            body = await response.json()
            return body

    client = OctopusApi(rate=50, resolution="sec", concurrency=6)
    result: List = client.execute(requests_list=[{
        "url": "https://api.pro.coinbase.com/products/ETH-EUR/candles?granularity=900&start=2021-12-04T00:00:00Z&end=2021-12-04T00:00:00Z",
        "params": {}}] * 1000, func=get_ethereum)
    print(result)

The TentacleSession works the same as how you write POST, GET, PUT and PATCH for aiohttp. TentacleSession 的工作方式与您为 aiohttp 编写 POST、GET、PUT 和 PATCH 的方式相同。

Let me know if it helps your issue related to rate limits and parallel calls.让我知道它是否有助于您解决与速率限制和并行调用相关的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM