[英]asyncio.gather not waiting long enough for all tasks to complete
I'm writing a code to get some links from a list of input urls using asyncio, aiohttp and BeautifulSoup.我正在编写代码以使用 asyncio、aiohttp 和 BeautifulSoup 从输入 url 列表中获取一些链接。
Here's a snippet of the relevant code:这是相关代码的片段:
def async_get_jpg_links(links):
def extractLinks(ep_num, html):
soup = bs4.BeautifulSoup(html, 'lxml',
parse_only = bs4.SoupStrainer('article'))
main = soup.findChildren('img')
return ep_num, [img_link.get('data-src') for img_link in main]
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
async def get_jpg_links(ep_links):
async with aiohttp.ClientSession() as session:
tasks = [get_htmllinks(session, num, link)
for num, link in enumerate(ep_links, 1)]
return await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
return loop.run_until_complete(get_jpg_links(links))
I then later call jpgs_links = dict(async_get_jpg_links(hrefs))
, where hrefs is a bunch of links (~170 links).然后我稍后调用jpgs_links = dict(async_get_jpg_links(hrefs))
,其中 hrefs 是一堆链接(~170 个链接)。
jpgs_links
should be a dictionary with numerical keys and a bunch of lists as values. jpgs_links
应该是一个带有数字键和一堆列表作为值的字典。 Some of the values come back as empty lists (which should instead be filled with data).一些值作为空列表返回(应该用数据填充)。 When I cut down the numbers of links in hrefs
, more of the lists come back full.当我减少hrefs
的链接hrefs
,更多的列表hrefs
满。
For the photo below, I reran the same code with a minute between, and as you can see, I get different lists that come back empty and different ones that come back full.对于下面的照片,我用一分钟的时间重新运行了相同的代码,正如你所看到的,我得到了不同的列表,它们返回空,不同的列表返回满。
Could it be that asyncio.gather is not waiting for all the tasks to finish?难道 asyncio.gather 没有等待所有任务完成?
How can I get asyncio to get me to return no empty lists, while keeping the number of links in hrefs
high?我怎样才能让 asyncio 不返回空列表,同时保持hrefs
的链接hrefs
很高?
So, turns out that some of the urls I sent in threw up the error:所以,事实证明我发送的一些 url 抛出了错误:
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 504, message='Gateway Time-out',...
So I changed所以我改变了
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
to到
async def get_htmllinks(session, ep_num, ep_link):
html_txt = None
while not html_txt:
try:
async with session.get(ep_link) as response:
response.raise_for_status()
html_txt = await response.text()
except aiohttp.ClientResponseError:
await asyncio.sleep(1)
return extractLinks(ep_num, html_txt)
What this does is that it retries the connection after sleeping for a second (the await asyncio.sleep(1)
does that).它的作用是在休眠一秒钟后重试连接( await asyncio.sleep(1)
这样做)。
Nothing to do with asyncio or BeautifulSoup, apparently.显然与 asyncio 或 BeautifulSoup 无关。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.