asyncio web scraping 101：使用aiohttp獲取多個url

Question

在之前的問題中， aiohttp一位作者提出了使用aiohttp使用Python 3.5的新async with語法獲取多個url的方法：

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.wait([loop.create_task(fetch(session, url))
                                  for url in urls])
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))
        # do something with the the_results

但是當其中一個session.get(url)請求中斷時（如上所述，因為http://SDFKHSKHGKLHSKLJHGSDFKSJH.com ），錯誤未得到處理，整個事情就會中斷。

我找了一些方法來插入關於session.get(url)結果的測試，例如尋找一個try ... except ...地方try ... except ... ，或者一個if response.status != 200:但我不是了解如何使用async with ， await和各種對象。

由於async with仍然很新，所以沒有很多例子。 如果asyncio向導可以顯示如何執行此操作，那將對許多人非常有幫助。 畢竟，大多數人想要使用asyncio測試的第一件事就是同時獲取多個資源。

目標

目標是我們可以檢查the_results並快速查看：

這個網址失敗了（為什么：狀態代碼，也許是異常名稱），或者
這個網址工作，這是一個有用的響應對象

Answer 1

我會使用gather而不是wait ，它可以將異常作為對象返回，而不會提升它們。 然后，您可以檢查每個結果，如果它是某個異常的實例。

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.gather(
        *[fetch(session, url) for url in urls],
        return_exceptions=True  # default is false, that would raise
    )

    # for testing purposes only
    # gather returns results in the order of coros
    for idx, url in enumerate(urls):
        print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = [
        'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
        'http://google.com',
        'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))

測試：

$python test.py 
http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
http://google.com: OK
http://twitter.com: OK

Answer 2

我遠非asyncio專家，但您想捕獲捕獲套接字錯誤所需的錯誤：

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                print(response.status == 200)
                return await response.text()
        except socket.error as e:
            print(e.strerror)

運行代碼並打印the_results ：

Cannot connect to host sdfkhskhgklhskljhgsdfksjh.com:80 ssl:False [Can not connect to sdfkhskhgklhskljhgsdfksjh.com:80 [Name or service not known]]
True
True
({<Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!DOCTYPE ht...y>\n</html>\n'>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result=None>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!doctype ht.../body></html>'>}, set())

你可以看到我們得到了錯誤，進一步的調用仍然成功返回html。

我們應該真正捕獲一個OSError，因為socket.error是自Python python以來不推薦的OSError別名：

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                return await response.text()
        except OSError as e:
            print(e)

如果你還想檢查響應是200，那么把你的if放在try中也可以使用reason屬性來獲取更多信息：

async def fetch(session, url):
    with aiohttp.Timeout(10):
        try:
            async with session.get(url) as response:
                if response.status != 200:
                    print(response.reason)
                return await response.text()
        except OSError as e:
            print(e.strerror)

asyncio web scraping 101：使用aiohttp獲取多個url

問題描述

2 個解決方案

解決方案1
14 已采納 2016-03-10 22:27:27

解決方案2
5 2016-03-10 21:26:27

asyncio web scraping 101：使用aiohttp獲取多個url

問題描述

2 個解決方案

解決方案1 14 已采納 2016-03-10 22:27:27

解決方案2 5 2016-03-10 21:26:27

解決方案1
14 已采納 2016-03-10 22:27:27

解決方案2
5 2016-03-10 21:26:27