简体   繁体   English

来自 asyncio.gather 的任务不能同时工作

[英]The tasks from asyncio.gather does not work concurrently

I want to scrape data from a website concurrently, but I found that the following program is NOT executed concurrently.我想同时从网站上抓取数据,但我发现以下程序不是同时执行的。

async def return_soup(url):
    r = requests.get(url)
    r.encoding = "utf-8"
    soup = BeautifulSoup(r.text, "html.parser")

    future = asyncio.Future()
    future.set_result(soup)
    return future

async def parseURL_async(url):    
    print("Started to download {0}".format(url))
    soup = await return_soup(url)
    print("Finished downloading {0}".format(url))

    return soup

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))

However, this program starts to download the second content only after the first one finishes.但是,该程序仅在第一个内容完成后才开始下载第二个内容。 If my understanding is correct, the await keyword on the await return_soup(url) awaits for the function to be complete, and while waiting for the completion, it returns back the control to the event loop, which enables the loop to start the second download.如果我的理解是正确的, await return_soup(url)上的await关键字等待函数完成,在等待完成的同时,将控制权返回给事件循环,使循环开始第二次下载.

And once the function finally finishes the execution, the future instance within it gets the result value.一旦函数最终完成执行,其中的未来实例将获得结果值。

But why does this not work concurrently?但是为什么这不能同时工作? What am I missing here?我在这里缺少什么?

Using asyncio is different from using threads in that you cannot add it to an existing code base to make it concurrent.使用 asyncio 与使用线程不同,因为您不能将其添加到现有代码库以使其并发。 Specifically, code that runs in the asyncio event loop must not block - all blocking calls must be replaced with non-blocking versions that yield control to the event loop.具体来说,在 asyncio 事件循环中运行的代码不得阻塞- 所有阻塞调用必须替换为非阻塞版本,这些版本将控制权交给事件循环。 In your case, requests.get blocks and defeats the parallelism implemented by asyncio.在您的情况下, requests.get阻止并破坏了 asyncio 实现的并行性。

To avoid this problem, you need to use an http library that is written with asyncio in mind, such as aiohttp .为避免此问题,您需要使用考虑到 asyncio 编写的 http 库,例如aiohttp

I'll add a little more to user4815162342's response.我将在 user4815162342 的回复中添加更多内容。 The asyncio framework uses coroutines that must cede control of the thread while they do the long operation. asyncio 框架使用协程,这些协程在执行长操作时必须放弃对线程的控制。 See the diagram at the end of this section for a nice graphical representation.请参阅本节末尾的图表以获取精美的图形表示。 As user4815162342 mentioned, the requests library doesn't support asyncio.正如 user4815162342 提到的,请求库不支持 asyncio。 I know of two ways to make this work concurrently.我知道两种同时进行这项工作的方法。 First, is to do what user4815162342 suggested and switch to a library with native support for asynchronous requests.首先,是按照 user4815162342 的建议进行操作,并切换到对异步请求提供本机支持的库。 The second is to run this synchronous code in separate threads or processes.第二种是在单独的线程或进程中运行此同步代码。 The latter is easy because of the run_in_executor function.由于run_in_executor函数,后者很容易。

loop = asyncio.get_event_loop()

async def return_soup(url):
    r = await loop.run_in_executor(None, requests.get, url)
    r.encoding = "utf-8"
    return BeautifulSoup(r.text, "html.parser")

async def parseURL_async(url):    
    print("Started to download {0}".format(url))
    soup = await return_soup(url)
    print("Finished downloading {0}".format(url))

    return soup

t = [parseURL_async(url_1), parseURL_async(url_2)]
loop.run_until_complete(asyncio.gather(*t))

This solution removes some of the benefit of using asyncio, as the long operation will still probably be executed from a fixed size thread pool, but it's also much easier to start with.这个解决方案消除了使用 asyncio 的一些好处,因为长操作仍然可能从固定大小的线程池中执行,但它也更容易开始。

The reason as mentioned in other answers is the lack of library support for coroutines.其他答案中提到的原因是缺乏对协程的库支持。

As of Python 3.9 though, you can use the function to_thread as an alternative for I/O concurrency.但是,从 Python 3.9 开始,您可以使用函数to_thread作为 I/O 并发的替代方法。

Obviously this is not exactly equivalent because as the name suggests it runs your functions in separate threads as opposed of a single thread in the event loop, but it can be a way to achieve I/O concurrency without relying on proper async support from the library.显然这并不完全等效,因为顾名思义,它在单独的线程中运行您的函数,而不是在事件循环中的单个线程中运行,但它可以是一种无需依赖库中适当的异步支持即可实现 I/O 并发的方法.

In your example the code would be:在您的示例中,代码将是:

def return_soup(url):
    r = requests.get(url)
    r.encoding = "utf-8"
    return BeautifulSoup(r.text, "html.parser")

def parseURL_async(url):
    print("Started to download {0}".format(url))
    soup = return_soup(url)
    print("Finished downloading {0}".format(url))
    return soup

async def main():
    result_url_1, result_url_2 = await asyncio.gather(
        asyncio.to_thread(lambda: parseURL_async(url_1)),
        asyncio.to_thread(lambda: parseURL_async(url_2)),
    )

asyncio.run(main())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM