Web Scraping with Python in combination with asyncio

Question

I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:

import requests ; from lxml import html
import asyncio

link = "http://quotes.toscrape.com/"

async def quotes_scraper(base_link):
        response = requests.get(base_link)
        tree = html.fromstring(response.text)
        for titles in tree.cssselect("span.tag-item a.tag"):
            processing_docs(base_link + titles.attrib['href'])

async def processing_docs(base_link):
        response = requests.get(base_link).text
        root = html.fromstring(response)
        for soups in root.cssselect("div.quote"):
            quote = soups.cssselect("span.text")[0].text
            author = soups.cssselect("small.author")[0].text
            print(quote, author)


        next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
        if next_page:
            page_link = link + next_page
            processing_docs(page_link)

loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()

Upon execution what I see in the console is:

RuntimeWarning: coroutine 'processing_docs' was never awaited
  processing_docs(base_link + titles.attrib['href'])

Answer 1

You need to call processing_docs() with await .

Replace:

processing_docs(base_link + titles.attrib['href'])

with:

await processing_docs(base_link + titles.attrib['href'])

And replace:

processing_docs(page_link)

with:

await processing_docs(page_link)

Otherwise it tries to run an asynchronous function synchronously and gets upset!

Web Scraping with Python in combination with asyncio

Question

1 answers

solution1
4 ACCPTED 2017-09-05 13:50:03

Web Scraping with Python in combination with asyncio

Question

1 answers

solution1 4 ACCPTED 2017-09-05 13:50:03

solution1
4 ACCPTED 2017-09-05 13:50:03