如何在asyncio中同時運行任務？

Question

我正在嘗試學習如何使用Python的asyncio模塊同時運行任務。 在下面的代碼中，我有一個模擬“網絡爬蟲”的例子。 基本上，我試圖在任何給定時間發生最多兩個活動的fetch（）請求，並且我希望在sleep（）期間調用process（）。

import asyncio

class Crawler():

    urlq = ['http://www.google.com', 'http://www.yahoo.com', 
            'http://www.cnn.com', 'http://www.gamespot.com', 
            'http://www.facebook.com', 'http://www.evergreen.edu']

    htmlq = []
    MAX_ACTIVE_FETCHES = 2
    active_fetches = 0

    def __init__(self):
        pass

    async def fetch(self, url):
        self.active_fetches += 1
        print("Fetching URL: " + url);
        await(asyncio.sleep(2))
        self.active_fetches -= 1
        self.htmlq.append(url)

    async def crawl(self):
        while self.active_fetches < self.MAX_ACTIVE_FETCHES:
            if self.urlq:
                url = self.urlq.pop()
                task = asyncio.create_task(self.fetch(url))
                await task
            else:
                print("URL queue empty")
                break;

    def process(self, page):
        print("processed page: " + page)

# main loop

c = Crawler()
while(c.urlq):
    asyncio.run(c.crawl())
    while c.htmlq:
        page = c.htmlq.pop()
        c.process(page)

但是，上面的代碼逐個下載URL（而不是一次兩個），並且在獲取所有URL之前不進行任何“處理”。 如何使fetch（）任務同時運行，並使其在sleep（）期間調用process（）？

Answer 1

您的crawl方法在每個單獨的任務之后等待; 你應該把它改成這個：

async def crawl(self):
    tasks = []
    while self.active_fetches < self.MAX_ACTIVE_FETCHES:
        if self.urlq:
            url = self.urlq.pop()
            tasks.append(asyncio.create_task(self.fetch(url)))
    await asyncio.gather(*tasks)

編輯：這是一個更清晰的版本，帶有注釋，可以同時獲取和處理所有注釋，同時保留了對最大采集數量設置上限的基本功能。

import asyncio

class Crawler:

    def __init__(self, urls, max_workers=2):
        self.urls = urls
        # create a queue that only allows a maximum of two items
        self.fetching = asyncio.Queue()
        self.max_workers = max_workers

    async def crawl(self):
        # DON'T await here; start consuming things out of the queue, and
        # meanwhile execution of this function continues. We'll start two
        # coroutines for fetching and two coroutines for processing.
        all_the_coros = asyncio.gather(
            *[self._worker(i) for i in range(self.max_workers)])

        # place all URLs on the queue
        for url in self.urls:
            await self.fetching.put(url)

        # now put a bunch of `None`'s in the queue as signals to the workers
        # that there are no more items in the queue.
        for _ in range(self.max_workers):
            await self.fetching.put(None)

        # now make sure everything is done
        await all_the_coros

    async def _worker(self, i):
        while True:
            url = await self.fetching.get()
            if url is None:
                # this coroutine is done; simply return to exit
                return

            print(f'Fetch worker {i} is fetching a URL: {url}')
            page = await self.fetch(url)
            self.process(page)

    async def fetch(self, url):
        print("Fetching URL: " + url);
        await asyncio.sleep(2)
        return f"the contents of {url}"

    def process(self, page):
        print("processed page: " + page)


# main loop
c = Crawler(['http://www.google.com', 'http://www.yahoo.com', 
             'http://www.cnn.com', 'http://www.gamespot.com', 
             'http://www.facebook.com', 'http://www.evergreen.edu'])
asyncio.run(c.crawl())

Answer 2

您可以將htmlq asyncio.Queue() ，並將htmlq.append更改為htmlq.push 。 然后你的main可以是異步的，像這樣：

async def main():
    c = Crawler()
    asyncio.create_task(c.crawl())
    while True:
        page = await c.htmlq.get()
        if page is None:
            break
        c.process(page)

您的頂級代碼歸結為對asyncio.run(main())的調用。

完成爬網后， crawl()可以將None排入隊列以通知主協程已完成工作。

如何在asyncio中同時運行任務？

問題描述

2 個解決方案

解決方案1
2 已采納 2019-01-12 03:23:01

解決方案2
1 2019-01-12 11:38:07

如何在asyncio中同時運行任務？

問題描述

2 個解決方案

解決方案1 2 已采納 2019-01-12 03:23:01

解決方案2 1 2019-01-12 11:38:07

解決方案1
2 已采納 2019-01-12 03:23:01

解決方案2
1 2019-01-12 11:38:07