简体   繁体   English

asyncio.gather 没有等待足够长的时间来完成所有任务

[英]asyncio.gather not waiting long enough for all tasks to complete

I'm writing a code to get some links from a list of input urls using asyncio, aiohttp and BeautifulSoup.我正在编写代码以使用 asyncio、aiohttp 和 BeautifulSoup 从输入 url 列表中获取一些链接。

Here's a snippet of the relevant code:这是相关代码的片段:

def async_get_jpg_links(links):
    def extractLinks(ep_num, html):
        soup = bs4.BeautifulSoup(html, 'lxml', 
            parse_only = bs4.SoupStrainer('article'))
        main = soup.findChildren('img')
        return ep_num, [img_link.get('data-src') for img_link in main]

    async def get_htmllinks(session, ep_num, ep_link):
        async with session.get(ep_link) as response:
            html_txt = await response.text()
        return extractLinks(ep_num, html_txt)

    async def get_jpg_links(ep_links):
        async with aiohttp.ClientSession() as session:
            tasks = [get_htmllinks(session, num, link) 
                    for num, link in enumerate(ep_links, 1)]
            return await asyncio.gather(*tasks)

    loop = asyncio.get_event_loop()
    return loop.run_until_complete(get_jpg_links(links))

I then later call jpgs_links = dict(async_get_jpg_links(hrefs)) , where hrefs is a bunch of links (~170 links).然后我稍后调用jpgs_links = dict(async_get_jpg_links(hrefs)) ,其中 hrefs 是一堆链接(~170 个链接)。

jpgs_links should be a dictionary with numerical keys and a bunch of lists as values. jpgs_links应该是一个带有数字键和一堆列表作为值的字典。 Some of the values come back as empty lists (which should instead be filled with data).一些值作为空列表返回(应该用数据填充)。 When I cut down the numbers of links in hrefs , more of the lists come back full.当我减少hrefs的链接hrefs ,更多的列表hrefs满。

For the photo below, I reran the same code with a minute between, and as you can see, I get different lists that come back empty and different ones that come back full.对于下面的照片,我用一分钟的时间重新运行了相同的代码,正如你所看到的,我得到了不同的列表,它们返回空,不同的列表返回满。

Could it be that asyncio.gather is not waiting for all the tasks to finish?难道 asyncio.gather 没有等待所有任务完成?

How can I get asyncio to get me to return no empty lists, while keeping the number of links in hrefs high?我怎样才能让 asyncio 不返回空列表,同时保持hrefs的链接hrefs很高?

代码的结果

So, turns out that some of the urls I sent in threw up the error:所以,事实证明我发送的一些 url 抛出了错误:

raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 504, message='Gateway Time-out',...

So I changed所以我改变了

async def get_htmllinks(session, ep_num, ep_link):
        async with session.get(ep_link) as response:
            html_txt = await response.text()
        return extractLinks(ep_num, html_txt)

to

async def get_htmllinks(session, ep_num, ep_link):
    html_txt = None
    while not html_txt:
        try:
            async with session.get(ep_link) as response:
                response.raise_for_status()
                html_txt = await response.text()
        except aiohttp.ClientResponseError:
            await asyncio.sleep(1)
    return extractLinks(ep_num, html_txt)

What this does is that it retries the connection after sleeping for a second (the await asyncio.sleep(1) does that).它的作用是在休眠一秒钟后重试连接( await asyncio.sleep(1)这样做)。

Nothing to do with asyncio or BeautifulSoup, apparently.显然与 asyncio 或 BeautifulSoup 无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM