简体   繁体   English

Python和Trio,生产者是消费者,工作完成后如何优雅退出?

[英]Python and Trio, where producers are consumers, how to exit gracefully when the job is done?

I'm trying to make a simple web crawler using trio an asks .我正在尝试使用trio an asks制作一个简单的 web 爬虫。 I use nursery to start a couple of crawlers at once, and memory channel to maintain a list of urls to visit.我使用nursery 一次启动几个爬虫,并使用memory 频道来维护要访问的url 列表。

Each crawler receives clones of both ends of that channel, so they can grab a url (via receive_channel), read it, find and add new urls to be visited (via send_channel).每个爬虫接收该通道两端的克隆,因此它们可以抓取 url(通过 receive_channel),读取它,查找并添加要访问的新 url(通过 send_channel)。

async def main():
    send_channel, receive_channel = trio.open_memory_channel(math.inf)
    async with trio.open_nursery() as nursery:
        async with send_channel, receive_channel:
            nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
            nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())
            nursery.start_soon(crawler, send_channel.clone(), receive_channel.clone())


async def crawler(send_channel, receive_channel):
    async for url in receive_channel:  # I'm a consumer!
        content = await ...
        urls_found = ...
        for u in urls_found:
            await send_channel.send(u)  # I'm a producer too!

In this scenario the consumers are the producers.在这种情况下,消费者生产者。 How to stop everything gracefully?如何优雅地停止一切?

The conditions for shutting everything down are:关闭一切的条件是:

  • channel is empty频道为空
  • AND
  • all crawlers are stuck at the first for loop, waiting for the url to appear in receive_channel (which... won't happen anymore)所有爬虫都卡在第一个 for 循环中,等待 url 出现在 receive_channel 中(这......不会再发生了)

I tried with async with send_channel inside crawler() but could not find a good way to do it.我尝试在crawler() async with send_channel但找不到一个好方法。 I also tried to find some different approach (some memory-channel-bound worker pool, etc), no luck here as well.我还尝试找到一些不同的方法(一些内存通道绑定的工作池等),这里也没有运气。

There are at least two problem here.这里至少有两个问题。

Firstly is your assumption about stopping when the channel is empty.首先是您关于在通道为空时停止的假设。 Since you allocate the memory channel with a size of 0, it will always be empty.由于您分配了大小为 0 的 memory 通道,因此它将始终为空。 You are only able to hand off a url, if a crawler is ready to receive it.如果爬虫准备好接收 url,您只能交出它。

This creates problem number two.这就产生了第二个问题。 If you ever find more urls than you have allocated crawlers, your application will deadlock.如果你发现的 url 比你分配的爬虫多,你的应用程序就会死锁。

The reason is, that since you wont be able to hand off all your found urls to a crawler, the crawler will never be ready to receive a new url to crawl, because it is stuck waiting for another crawler to take one of its urls.原因是,由于您无法将所有找到的 url 移交给爬虫,爬虫永远不会准备好接收新的 url 进行爬网,因为它一直在等待另一个爬虫获取它的一个 url。

This gets even worse, because assuming one of the other crawlers find new urls, they too will get stuck behind the crawler that is already waiting to hand off its urls and they will never be able to take one of the urls that are waiting to be processed.这会变得更糟,因为假设其他爬虫之一找到新的 url,它们也会被困在已经等待交出其 url 的爬虫后面,并且他们将永远无法获取正在等待的 url 之一处理。

Relevant portion of the documentation:文档的相关部分:

https://trio.readthedocs.io/en/stable/reference-core.html#buffering-in-channels https://trio.readthedocs.io/en/stable/reference-core.html#buffering-in-channels

Assuming we fix that, where to go next?假设我们解决了这个问题,接下来 go 在哪里?

You probably need to keep a list (set?) of all visited urls, to make sure you dont visit them again.您可能需要保留所有访问过的 url 的列表(设置?),以确保您不会再次访问它们。

To actually figure out when to stop, instead of closing the channels, it is probably a lot easier to simply cancel the nursery.要真正弄清楚何时停止,而不是关闭通道,简单地取消托儿所可能要容易得多。

Lets say we modify the main loop like this:假设我们像这样修改主循环:

async def main():
    send_channel, receive_channel = trio.open_memory_channel(math.inf)
    active_workers = trio.CapacityLimiter(3) # Number of workers
    async with trio.open_nursery() as nursery:
        async with send_channel, receive_channel:
            nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
            nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
            nursery.start_soon(crawler, active_workers, send_channel, receive_channel)
            while True:
                await trio.sleep(1) # Give the workers a chance to start up.
                if active_workers.borrowed_tokens == 0 and send_channel.statistics().current_buffer_used == 0:
                    nursery.cancel_scope.cancel() # All done!

Now we need to modify the crawlers slightly, to pick up a token when active.现在我们需要稍微修改爬虫,以便在活动时获取令牌。

async def crawler(active_workers, send_channel, receive_channel):
    async for url in receive_channel:  # I'm a consumer!
        with active_workers:
            content = await ...
            urls_found = ...
            for u in urls_found:
                await send_channel.send(u)  # I'm a producer too!

Other things to consider -其他需要考虑的事情 -

You may want to use send_channel.send_noblock(u) in the crawler.您可能希望在爬虫中使用send_channel.send_noblock(u) Since you have an unbounded buffer, there is no chance of a WouldBlock exception, and the behaviour of not having a checkpoint trigger on every send might be desireable.由于您有一个无界缓冲区,因此不会出现 WillBlock 异常,并且每次发送都没有检查点触发器的行为可能是可取的。 That way you know for sure, that a particular url is fully processed and all new urls have been added, before other tasks get a chance to grab a new url, or the parent task get a chance to check if work is done.这样您就可以肯定地知道,在其他任务有机会获取新的 url 或父任务有机会检查工作是否完成之前,特定的 url 已完全处理并添加了所有新 url。

This is a solution I came up with when I tried to reorganize the problem:这是我尝试重新组织问题时提出的解决方案:

async def main():
    send_channel, receive_channel = trio.open_memory_channel(math.inf)
 
    limit = trio.CapacityLimiter(3)

    async with send_channel:
        await send_channel.send(('https://start-url', send_channel.clone()))
    #HERE1

    async with trio.open_nursery() as nursery:
        async for url, send_channel in receive_channel:  #HERE3
            nursery.start(consumer, url, send_channel, limit)

async def crawler(url, send_channel, limit, task_status):
    async with limit, send_channel:
        content = await ...
        links = ...
        for link in links:
            await send_channel.send((link, send_channel.clone()))
    #HERE2

(I skipped skipping visited urls) (我跳过了访问过的网址)

Here, there is no 3 long lived consumers, but there is at most 3 consumers whenever there is enough work for them.这里没有 3 个长寿命消费者,但只要有足够的工作量,最多有 3 个消费者。

At #HERE1 the send_channel is closed (because it was used as context manager), the only thing that's keeping the channel alive is a clone of it, inside that channel.在#HERE1,send_channel 被关闭(因为它被用作上下文管理器),唯一使通道保持活动状态的是它的克隆,在该通道内。

At #HERE2 the clone is also closed (because context manager).在#HERE2,克隆也被关闭(因为上下文管理器)。 If the channel is empty, then that clone was the last thing keeping the channel alive.如果通道为空,则该克隆是保持通道活动的最后一件事。 Channel dies, for loop ends (#HERE3).通道死亡,for 循环结束 (#HERE3)。

UNLESS there were urls found, in which case they were added to the channel, together with more clones of send_channel that will keep the channel alive long enough to process those urls.除非找到 url,在这种情况下,它们被添加到通道中,以及更多的 send_channel 克隆,这将使通道保持足够长的时间来处理这些 url。

Both this and Anders E. Andersen's solutions feel hacky to me: one is using sleep and statistics() , the other creates clones of send_channel and puts them in the channel... feels like a software implementation of klein bottle to me.这和 Anders E. Andersen 的解决方案都让我觉得很奇怪:一个是使用sleepstatistics() ,另一个创建 send_channel 的克隆并将它们放入通道中......对我来说就像是克莱因瓶的软件实现。 I will probably look for some other approaches.我可能会寻找其他一些方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM