asyncio.gather 没有等待足够长的时间来完成所有任务

Question

I'm writing a code to get some links from a list of input urls using asyncio, aiohttp and BeautifulSoup.我正在编写代码以使用 asyncio、aiohttp 和 BeautifulSoup 从输入 url 列表中获取一些链接。

Here's a snippet of the relevant code:这是相关代码的片段：

def async_get_jpg_links(links):
    def extractLinks(ep_num, html):
        soup = bs4.BeautifulSoup(html, 'lxml', 
            parse_only = bs4.SoupStrainer('article'))
        main = soup.findChildren('img')
        return ep_num, [img_link.get('data-src') for img_link in main]

    async def get_htmllinks(session, ep_num, ep_link):
        async with session.get(ep_link) as response:
            html_txt = await response.text()
        return extractLinks(ep_num, html_txt)

    async def get_jpg_links(ep_links):
        async with aiohttp.ClientSession() as session:
            tasks = [get_htmllinks(session, num, link) 
                    for num, link in enumerate(ep_links, 1)]
            return await asyncio.gather(*tasks)

    loop = asyncio.get_event_loop()
    return loop.run_until_complete(get_jpg_links(links))

I then later call jpgs_links = dict(async_get_jpg_links(hrefs)) , where hrefs is a bunch of links (~170 links).然后我稍后调用jpgs_links = dict(async_get_jpg_links(hrefs)) ，其中 hrefs 是一堆链接（~170 个链接）。

jpgs_links should be a dictionary with numerical keys and a bunch of lists as values. jpgs_links应该是一个带有数字键和一堆列表作为值的字典。 Some of the values come back as empty lists (which should instead be filled with data).一些值作为空列表返回（应该用数据填充）。 When I cut down the numbers of links in hrefs , more of the lists come back full.当我减少hrefs的链接hrefs ，更多的列表hrefs满。

For the photo below, I reran the same code with a minute between, and as you can see, I get different lists that come back empty and different ones that come back full.对于下面的照片，我用一分钟的时间重新运行了相同的代码，正如你所看到的，我得到了不同的列表，它们返回空，不同的列表返回满。

Could it be that asyncio.gather is not waiting for all the tasks to finish?难道 asyncio.gather 没有等待所有任务完成？

How can I get asyncio to get me to return no empty lists, while keeping the number of links in hrefs high?我怎样才能让 asyncio 不返回空列表，同时保持hrefs的链接hrefs很高？

Answer 1

So, turns out that some of the urls I sent in threw up the error:所以，事实证明我发送的一些 url 抛出了错误：

raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 504, message='Gateway Time-out',...

So I changed所以我改变了

async def get_htmllinks(session, ep_num, ep_link):
        async with session.get(ep_link) as response:
            html_txt = await response.text()
        return extractLinks(ep_num, html_txt)

to到

async def get_htmllinks(session, ep_num, ep_link):
    html_txt = None
    while not html_txt:
        try:
            async with session.get(ep_link) as response:
                response.raise_for_status()
                html_txt = await response.text()
        except aiohttp.ClientResponseError:
            await asyncio.sleep(1)
    return extractLinks(ep_num, html_txt)

What this does is that it retries the connection after sleeping for a second (the await asyncio.sleep(1) does that).它的作用是在休眠一秒钟后重试连接（ await asyncio.sleep(1)这样做）。

Nothing to do with asyncio or BeautifulSoup, apparently.显然与 asyncio 或 BeautifulSoup 无关。

asyncio.gather 没有等待足够长的时间来完成所有任务

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-08-24 17:14:56

asyncio.gather 没有等待足够长的时间来完成所有任务

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-08-24 17:14:56

解决方案1
1 已采纳 2020-08-24 17:14:56