[英]Fetch multiple URLs with asyncio/aiohttp and retry for failures
[英]asyncio web scraping 101: fetching multiple urls with aiohttp
在之前的問題中, aiohttp
一位作者提出了使用aiohttp使用Python 3.5
的新async with
語法獲取多個url的方法:
import aiohttp
import asyncio
async def fetch(session, url):
with aiohttp.Timeout(10):
async with session.get(url) as response:
return await response.text()
async def fetch_all(session, urls, loop):
results = await asyncio.wait([loop.create_task(fetch(session, url))
for url in urls])
return results
if __name__ == '__main__':
loop = asyncio.get_event_loop()
# breaks because of the first url
urls = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
'http://google.com',
'http://twitter.com']
with aiohttp.ClientSession(loop=loop) as session:
the_results = loop.run_until_complete(
fetch_all(session, urls, loop))
# do something with the the_results
但是當其中一個session.get(url)
請求中斷時(如上所述,因為http://SDFKHSKHGKLHSKLJHGSDFKSJH.com
),錯誤未得到處理,整個事情就會中斷。
我找了一些方法來插入關於session.get(url)
結果的測試,例如尋找一個try ... except ...
地方try ... except ...
,或者一個if response.status != 200:
但我不是了解如何使用async with
, await
和各種對象。
由於async with
仍然很新,所以沒有很多例子。 如果asyncio
向導可以顯示如何執行此操作,那將對許多人非常有幫助。 畢竟,大多數人想要使用asyncio
測試的第一件事就是同時獲取多個資源。
目標
目標是我們可以檢查the_results
並快速查看:
我會使用gather
而不是wait
,它可以將異常作為對象返回,而不會提升它們。 然后,您可以檢查每個結果,如果它是某個異常的實例。
import aiohttp
import asyncio
async def fetch(session, url):
with aiohttp.Timeout(10):
async with session.get(url) as response:
return await response.text()
async def fetch_all(session, urls, loop):
results = await asyncio.gather(
*[fetch(session, url) for url in urls],
return_exceptions=True # default is false, that would raise
)
# for testing purposes only
# gather returns results in the order of coros
for idx, url in enumerate(urls):
print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
return results
if __name__ == '__main__':
loop = asyncio.get_event_loop()
# breaks because of the first url
urls = [
'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
'http://google.com',
'http://twitter.com']
with aiohttp.ClientSession(loop=loop) as session:
the_results = loop.run_until_complete(
fetch_all(session, urls, loop))
測試:
$python test.py
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
http://google.com: OK
http://twitter.com: OK
我遠非asyncio專家,但您想捕獲捕獲套接字錯誤所需的錯誤:
async def fetch(session, url):
with aiohttp.Timeout(10):
try:
async with session.get(url) as response:
print(response.status == 200)
return await response.text()
except socket.error as e:
print(e.strerror)
運行代碼並打印the_results :
Cannot connect to host sdfkhskhgklhskljhgsdfksjh.com:80 ssl:False [Can not connect to sdfkhskhgklhskljhgsdfksjh.com:80 [Name or service not known]]
True
True
({<Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!DOCTYPE ht...y>\n</html>\n'>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result=None>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!doctype ht.../body></html>'>}, set())
你可以看到我們得到了錯誤,進一步的調用仍然成功返回html。
我們應該真正捕獲一個OSError,因為socket.error是自Python python以來不推薦的OSError別名 :
async def fetch(session, url):
with aiohttp.Timeout(10):
try:
async with session.get(url) as response:
return await response.text()
except OSError as e:
print(e)
如果你還想檢查響應是200,那么把你的if放在try中也可以使用reason屬性來獲取更多信息:
async def fetch(session, url):
with aiohttp.Timeout(10):
try:
async with session.get(url) as response:
if response.status != 200:
print(response.reason)
return await response.text()
except OSError as e:
print(e.strerror)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.