使用Python结合asyncio进行Web Scraping

Question

I've written a script in python to get some information from a webpage. 我在python中编写了一个脚本来从网页上获取一些信息。 The code itself is running flawlessly if it is taken out of the asyncio. 如果从asyncio中取出代码，代码本身就会运行得很完美。 However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. 但是，当我的脚本同步运行时，我想让它通过异步过程，以便它在尽可能短的时间内完成任务，从而提供最佳性能，而且显然不是阻塞方式。 As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. 因为我没有使用过这个asyncio库，所以我很困惑如何使它成为一个可行的。 I've tried to fit my script within the asyncio process but it doesn't seem right. 我试图在asyncio进程中使用我的脚本，但它似乎不对。 If somebody stretches a helping hand to complete this, I would really be grateful to him. 如果有人伸出援助之手来完成这件事，我真的很感激他。 Thanks is advance. 谢谢你的提前。 Here is my erroneous code: 这是我错误的代码：

import requests ; from lxml import html
import asyncio

link = "http://quotes.toscrape.com/"

async def quotes_scraper(base_link):
        response = requests.get(base_link)
        tree = html.fromstring(response.text)
        for titles in tree.cssselect("span.tag-item a.tag"):
            processing_docs(base_link + titles.attrib['href'])

async def processing_docs(base_link):
        response = requests.get(base_link).text
        root = html.fromstring(response)
        for soups in root.cssselect("div.quote"):
            quote = soups.cssselect("span.text")[0].text
            author = soups.cssselect("small.author")[0].text
            print(quote, author)


        next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
        if next_page:
            page_link = link + next_page
            processing_docs(page_link)

loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()

Upon execution what I see in the console is: 在执行时，我在控制台中看到的是：

RuntimeWarning: coroutine 'processing_docs' was never awaited
  processing_docs(base_link + titles.attrib['href'])

Answer 1

You need to call processing_docs() with await . 您需要使用await调用processing_docs() 。

Replace: 更换：

processing_docs(base_link + titles.attrib['href'])

with: 有：

await processing_docs(base_link + titles.attrib['href'])

And replace: 并替换：

processing_docs(page_link)

with: 有：

await processing_docs(page_link)

Otherwise it tries to run an asynchronous function synchronously and gets upset! 否则它会尝试同步运行异步函数并让人心烦意乱！

使用Python结合asyncio进行Web Scraping

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-09-05 13:50:03

使用Python结合asyncio进行Web Scraping

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-09-05 13:50:03

解决方案1
4 已采纳 2017-09-05 13:50:03