簡體   English   中英

將任務添加到python asyncio

[英]Adding tasks to python asyncio

我正在嘗試編寫一個簡單的Web爬網程序,以測試新的asyncio模塊的工作方式,但是我出了點問題。 我正在嘗試使用單個URL來啟動搜尋器。 該腳本應下載該頁面,在頁面上找到任何<a>標記,並安排它們也進行下載。 我期望的輸出是一堆線,指示已經下載了第一頁,然后以隨機順序(即它們被下載)隨后的頁面,直到所有操作都完成了,但是看來實際上它們只是按順序下載了。 一般來說,對於異步模塊(特別是異步模塊),我是一個全新的人,因此,我確定我缺少一些基本概念。

到目前為止,這是我的代碼:

import asyncio
import re
import requests
import time
from bs4 import BeautifulSoup
from functools import partial

@asyncio.coroutine
def get_page(url, depth=0):
    print('%s: Getting %s' % (time.time(), url))
    page = requests.get(url)
    print('%s: Got %s' % (time.time(), url))
    soup = BeautifulSoup(page.text)
    if depth < 2:
        for a in soup.find_all('a', href=re.compile(r'\w+\.html'))[:3]:
            u = 'https://docs.python.org/3/' + a['href']
            print('%s: Scheduling %s' % (time.time(), u))
            yield from get_page(u, depth+1)
    if depth == 0:
        loop.stop()
    return soup

root = 'https://docs.python.org/3/'
loop = asyncio.get_event_loop()
loop.create_task(get_page(root))
loop.run_forever()

這是輸出:

1434971882.3458219: Getting https://docs.python.org/3/
1434971893.0054126: Got https://docs.python.org/3/
1434971893.015218: Scheduling https://docs.python.org/3/genindex.html
1434971893.0153584: Getting https://docs.python.org/3/genindex.html
1434971894.464993: Got https://docs.python.org/3/genindex.html
1434971894.4752269: Scheduling https://docs.python.org/3/py-modindex.html
1434971894.4753256: Getting https://docs.python.org/3/py-modindex.html
1434971896.9845033: Got https://docs.python.org/3/py-modindex.html
1434971897.0756354: Scheduling https://docs.python.org/3/index.html
1434971897.0757186: Getting https://docs.python.org/3/index.html
1434971907.451529: Got https://docs.python.org/3/index.html
1434971907.4600112: Scheduling https://docs.python.org/3/genindex-Symbols.html
1434971907.4600625: Getting https://docs.python.org/3/genindex-Symbols.html
1434971917.6517148: Got https://docs.python.org/3/genindex-Symbols.html
1434971917.6789174: Scheduling https://docs.python.org/3/py-modindex.html
1434971917.6789672: Getting https://docs.python.org/3/py-modindex.html
1434971919.454042: Got https://docs.python.org/3/py-modindex.html
1434971919.574361: Scheduling https://docs.python.org/3/genindex.html
1434971919.574434: Getting https://docs.python.org/3/genindex.html
1434971920.5942516: Got https://docs.python.org/3/genindex.html
1434971920.6020699: Scheduling https://docs.python.org/3/index.html
1434971920.6021295: Getting https://docs.python.org/3/index.html
1434971922.1504402: Got https://docs.python.org/3/index.html
1434971922.1589775: Scheduling https://docs.python.org/3/library/__future__.html#module-__future__
1434971922.1590302: Getting https://docs.python.org/3/library/__future__.html#module-__future__
1434971923.30988: Got https://docs.python.org/3/library/__future__.html#module-__future__
1434971923.3215268: Scheduling https://docs.python.org/3/whatsnew/3.4.html
1434971923.321574: Getting https://docs.python.org/3/whatsnew/3.4.html
1434971926.6502898: Got https://docs.python.org/3/whatsnew/3.4.html
1434971926.89331: Scheduling https://docs.python.org/3/../genindex.html
1434971926.8934016: Getting https://docs.python.org/3/../genindex.html
1434971929.0996494: Got https://docs.python.org/3/../genindex.html
1434971929.1068246: Scheduling https://docs.python.org/3/../py-modindex.html
1434971929.1068716: Getting https://docs.python.org/3/../py-modindex.html
1434971932.5949798: Got https://docs.python.org/3/../py-modindex.html
1434971932.717457: Scheduling https://docs.python.org/3/3.3.html
1434971932.7175465: Getting https://docs.python.org/3/3.3.html
1434971934.009238: Got https://docs.python.org/3/3.3.html

使用asyncio不會神奇地使所有代碼異步。 在這種情況下, requests被阻止,因此所有協程將等待它。

有一個名為aiohttp的異步庫,該庫允許異步http請求,盡管它不如requests那樣友好。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM