简体   繁体   中英

Adding tasks to python asyncio

I am trying to write a simple web crawler in order to test how the new asyncio module works, but there is something I'm getting wrong. I am trying to initiate the crawler with a single URL. The script should download that page, find any <a> tags on the page, and schedule them to be downloaded too. The output I expect is a bunch of lines indicating the the first page has been downloaded, then subsequent pages in a random order (ie as they are downloaded) until all are done, but it seems that actually they are just downloaded sequentially. I'm completely new to async in general and this module specifically so I'm sure there are just some fundamental concepts I'm missing.

Here is my code so far:

import asyncio
import re
import requests
import time
from bs4 import BeautifulSoup
from functools import partial

@asyncio.coroutine
def get_page(url, depth=0):
    print('%s: Getting %s' % (time.time(), url))
    page = requests.get(url)
    print('%s: Got %s' % (time.time(), url))
    soup = BeautifulSoup(page.text)
    if depth < 2:
        for a in soup.find_all('a', href=re.compile(r'\w+\.html'))[:3]:
            u = 'https://docs.python.org/3/' + a['href']
            print('%s: Scheduling %s' % (time.time(), u))
            yield from get_page(u, depth+1)
    if depth == 0:
        loop.stop()
    return soup

root = 'https://docs.python.org/3/'
loop = asyncio.get_event_loop()
loop.create_task(get_page(root))
loop.run_forever()

And here is the output:

1434971882.3458219: Getting https://docs.python.org/3/
1434971893.0054126: Got https://docs.python.org/3/
1434971893.015218: Scheduling https://docs.python.org/3/genindex.html
1434971893.0153584: Getting https://docs.python.org/3/genindex.html
1434971894.464993: Got https://docs.python.org/3/genindex.html
1434971894.4752269: Scheduling https://docs.python.org/3/py-modindex.html
1434971894.4753256: Getting https://docs.python.org/3/py-modindex.html
1434971896.9845033: Got https://docs.python.org/3/py-modindex.html
1434971897.0756354: Scheduling https://docs.python.org/3/index.html
1434971897.0757186: Getting https://docs.python.org/3/index.html
1434971907.451529: Got https://docs.python.org/3/index.html
1434971907.4600112: Scheduling https://docs.python.org/3/genindex-Symbols.html
1434971907.4600625: Getting https://docs.python.org/3/genindex-Symbols.html
1434971917.6517148: Got https://docs.python.org/3/genindex-Symbols.html
1434971917.6789174: Scheduling https://docs.python.org/3/py-modindex.html
1434971917.6789672: Getting https://docs.python.org/3/py-modindex.html
1434971919.454042: Got https://docs.python.org/3/py-modindex.html
1434971919.574361: Scheduling https://docs.python.org/3/genindex.html
1434971919.574434: Getting https://docs.python.org/3/genindex.html
1434971920.5942516: Got https://docs.python.org/3/genindex.html
1434971920.6020699: Scheduling https://docs.python.org/3/index.html
1434971920.6021295: Getting https://docs.python.org/3/index.html
1434971922.1504402: Got https://docs.python.org/3/index.html
1434971922.1589775: Scheduling https://docs.python.org/3/library/__future__.html#module-__future__
1434971922.1590302: Getting https://docs.python.org/3/library/__future__.html#module-__future__
1434971923.30988: Got https://docs.python.org/3/library/__future__.html#module-__future__
1434971923.3215268: Scheduling https://docs.python.org/3/whatsnew/3.4.html
1434971923.321574: Getting https://docs.python.org/3/whatsnew/3.4.html
1434971926.6502898: Got https://docs.python.org/3/whatsnew/3.4.html
1434971926.89331: Scheduling https://docs.python.org/3/../genindex.html
1434971926.8934016: Getting https://docs.python.org/3/../genindex.html
1434971929.0996494: Got https://docs.python.org/3/../genindex.html
1434971929.1068246: Scheduling https://docs.python.org/3/../py-modindex.html
1434971929.1068716: Getting https://docs.python.org/3/../py-modindex.html
1434971932.5949798: Got https://docs.python.org/3/../py-modindex.html
1434971932.717457: Scheduling https://docs.python.org/3/3.3.html
1434971932.7175465: Getting https://docs.python.org/3/3.3.html
1434971934.009238: Got https://docs.python.org/3/3.3.html

Using asyncio doesn't magically make all your code asynchronous. In this case, requests is blocking, so all your coroutines will wait for it.

There is an async library called aiohttp that allows async http requests, although it isn't as user-friendly as requests .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM