简体   繁体   English

最大化并行请求数 (aiohttp)

[英]Maximize number of parallel requests (aiohttp)

tl;dr : how do I maximize number of http requests I can send in parallel? tl;dr :如何最大限度地增加可以并行发送的 http 请求数?

I am fetching data from multiple urls with aiohttp library.我正在使用aiohttp库从多个 url 获取数据。 I'm testing its performance and I've observed that somewhere in the process there is a bottleneck, where running more urls at once just doesn't help.我正在测试它的性能,我观察到在这个过程中的某个地方存在瓶颈,一次运行更多的 url 无济于事。

I am using this code:我正在使用此代码:

import asyncio
import aiohttp

async def fetch(url, session):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'}
    try:
        async with session.get(
            url, headers=headers, 
            ssl = False, 
            timeout = aiohttp.ClientTimeout(
                total=None, 
                sock_connect = 10, 
                sock_read = 10
            )
        ) as response:
            content = await response.read()
            return (url, 'OK', content)
    except Exception as e:
        print(e)
        return (url, 'ERROR', str(e))

async def run(url_list):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in url_list:
            task = asyncio.ensure_future(fetch(url, session))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses
    return responses

loop = asyncio.get_event_loop()
asyncio.set_event_loop(loop)
task = asyncio.ensure_future(run(url_list))
loop.run_until_complete(task)
result = task.result().result()

Running this with url_list of varying length (tests against https://httpbin.org/delay/2 ) I see that adding more urls to be run at once helps only up to ~100 urls and then total time starts to grow proportionally to number of urls (or in other words, time per one url does not decrease).使用不同长度的url_list运行这个(针对https://httpbin.org/delay/2 的测试)我看到添加更多的 url 一次运行最多只能帮助 100 个 url,然后总时间开始与数字成比例增长网址(或者换句话说,每个网址的时间不会减少)。 This suggests that something fails when trying to process these at once.这表明在尝试立即处理这些时会出现问题。 In addition, with more urls in 'one batch' I am occasionally receiving connection timeout errors.此外,在“一批”中有更多网址时,我偶尔会收到连接超时错误。

在此处输入图片说明

  • Why is it happening?为什么会发生? What exactly limits the speed here?究竟是什么限制了这里的速度
  • How can I check what is the maximum number of parallel requests I can send on a given computer?如何检查在给定计算机上可以发送的最大并行请求数是多少? (I mean an exact number - not approx by 'trial and error' as above) (我的意思是一个确切的数字 - 不是如上所述的“反复试验”)
  • What can I do to increase the number of requests processed at once?我可以做些什么来增加一次处理的请求数量?

I am runnig this on Windows.我在 Windows 上运行这个。

EDIT in response to comment:编辑以回应评论:

This is the same data with limit set to None .这是限制设置为None的相同数据。 Only slight improvement in the end and there are many connection timeout errors with 400 urls sent at once.最后只有轻微的改进,并且一次发送了 400 个 url 的连接超时错误很多。 I ended up using limit = 200 on my actual data.我最终在我的实际数据上使用了limit = 200

在此处输入图片说明

By default aiohttp limits number of simultaneous connections to 100 .默认情况下, aiohttp将同时连接数限制为100 It achieves by setting default limit to TCPConnector object that is used by ClientSession .它通过为ClientSession使用的TCPConnector 对象设置默认limit来实现。 You can bypass it by creating and passing custom connector to session:您可以通过创建自定义连接器并将其传递给会话来绕过它:

connector = aiohttp.TCPConnector(limit=None)
async with aiohttp.ClientSession(connector=connector) as session:
    # ...

Note however that you probably don't want to set this number too high: your network capacity, CPU, RAM and target server have their own limits and try to make enormous amount of connection can lead to increasing failures.但是请注意,您可能不想将此数字设置得太高:您的网络容量、CPU、RAM 和目标服务器都有自己的限制,尝试进行大量连接可能会导致故障增加。

Optimal number can probably be found only through experiments on concrete machine.可能只有通过在混凝土机器上的实验才能找到最佳数量。


Unrelated:无关:

You don't have to create tasks without reason .您不必无故创建任务。 Most asyncio api accept regular coroutines.大多数 asyncio api 接受常规协程。 For example, your last lines of code can be altered this way:例如,您的最后几行代码可以这样修改:

loop = asyncio.get_event_loop()
loop.run_until_complete(run(url_list))

Or even to just asyncio.run(run(url_list)) (doc ) if you're using Python 3.7或者甚至只是asyncio.run(run(url_list)) (doc ) 如果您使用的是 Python 3.7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM