如何使用 Python 多处理池永远消耗队列中的项目

Question

I'm trying to create a worker that listens to http requests and adds jobs IDs to a queue.我正在尝试创建一个 worker 来监听 http 请求并将作业 ID 添加到队列中。 I'm using Python's built-in multiprocessing module for that.为此，我正在使用 Python 的内置多处理模块。

I need a Pool with a few processes that will process the job from queue and respawn.我需要一个带有几个进程的池，这些进程将处理来自队列和重生的作业。 Processes have to restart, bacause for some cases job processing can cause memory leak.进程必须重新启动，因为在某些情况下作业处理会导致 memory 泄漏。 Pool should run forever as the items will be added to the queue dynamically.池应该永远运行，因为项目将动态添加到队列中。

The problem is that my pool does not respawn workers after they complete.问题是我的池在完成后不会重生工作人员。

How can I use pool to achieve this?我怎样才能使用池来实现这一目标？ I want it to run forever, consume item from queue and respawn child after every task.我希望它永远运行，消耗队列中的项目并在每个任务后重生孩子。

from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from multiprocessing import Pool, SimpleQueue, current_process

queue = SimpleQueue()

def do_something(q):
    worker_id = current_process().pid
    print(f"Worker {worker_id} spawned")
    item_id = q.get()
    print(f"Worker {worker_id} received id: {item_id}")
    # long_term_operation_that_leaks_memory(item_id)
    # print(f"Worker {worker_id} completed id: {item_id}")

def main():
    with Pool(
        processes=2, initializer=do_something, initargs=(queue,), maxtasksperchild=1
    ):
        queue.put("a")
        queue.put("b")
        queue.put("c")
        server_address = ("", 8000)
        httpd = ThreadingHTTPServer(server_address, BaseHTTPRequestHandler)
        try:
            httpd.serve_forever()
        except (KeyboardInterrupt, SystemExit):
            pass

if __name__ == "__main__":
    main()

I tried with initializer and maxtasksperchild but it does not work.我尝试使用initializer和maxtasksperchild但它不起作用。

I know I can add new processes to a pool using map, but I don't have a map of an infinite possible tasks from the future.我知道我可以使用 map 将新进程添加到池中，但我没有未来无限可能任务的 map。 I think initializer should be responsible for all new tasks.我认为initializer应该负责所有新任务。 But I don't know how to force it to run forever and respawn.但我不知道如何强迫它永远运行并重生。

In my code example "c" item is never processed.在我的代码示例中，“c”项目从未被处理过。 Therefore if I add http logic to put more items it will not work either.因此，如果我添加 http 逻辑来放置更多项目，它也不会起作用。 Adding http logic to this code is not necessary part of my question, but any tips will be welcomed.在此代码中添加 http 逻辑不是我的问题的必要部分，但欢迎任何提示。

Thanks!谢谢！

Edit:编辑：

The reason I decided to use Pool in this case, is that official documentation says:我决定在这种情况下使用 Pool 的原因是官方文档说：

Worker processes within a Pool typically live for the complete duration of the Pool's work queue. Pool 中的工作进程通常在 Pool 的工作队列的整个持续时间内都存在。 A frequent pattern found in other systems (such as Apache, mod_wsgi, etc) to free resources held by workers is to allow a worker within a pool to complete only a set amount of work before being exiting, being cleaned up and a new process spawned to replace the old one.在其他系统（例如 Apache、mod_wsgi 等）中发现的一种释放 worker 持有的资源的常见模式是允许池中的 worker 在退出、清理和生成新进程之前仅完成一定数量的工作更换旧的。 The maxtasksperchild argument to the Pool exposes this ability to the end user. Pool 的 maxtasksperchild 参数向最终用户公开了此功能。

My goals:我的目标：

Items will be added dynamically to the queue by http requests项目将通过 http 请求动态添加到队列中
Pool will live forever游泳池将永远存在
Worker process will perform only one task from queue and will be respawned工作进程将只执行队列中的一项任务并将重新生成

Why I used only 2 processes?为什么我只使用了 2 个进程？

Processes number will not be infinite and it is easy to test my example with 2 processes rather that 5 or 10.进程数不会是无限的，用 2 个进程而不是 5 个或 10 个进程来测试我的示例很容易。

Why I put 3 items manually?为什么我手动放置 3 个项目？ It is for example purpose, in real solution all items will be added dynamically, so there is no way to loop over them or to use map on them.这是为了示例目的，在实际解决方案中，所有项目都将动态添加，因此无法循环遍历它们或对它们使用 map。

Answer 1

What you are doing with your pool initializer is most unusual.您对池初始化程序所做的事情是最不寻常的。 Such an initializer is run for each pool process and is used to initialize that process (for example, setting global variables) so that it is able to run tasks that are submitted .这样的初始化程序为每个池进程运行，并用于初始化该进程（例如，设置全局变量），以便它能够运行提交的任务。 A multiprocessing pool implements a hidden task queue for holding submitted tasks waiting to be processed by an available pool process.多处理池实现了一个隐藏的任务队列，用于保存提交的等待可用池进程处理的任务。 Your initializer code is only capable of executing a single quasi-task (I reserve the term task for work submitted to the processing pool in the "normal" way) and then it returns.您的初始化程序代码只能执行一个准任务（我保留术语任务以“正常”方式提交给处理池的工作）然后它返回。 That is, you are putting 3 items on the queue yet you only have 2 pool processes getting from the queue a single item, processing it and then returning.也就是说，您将 3 个项目放在队列中，但您只有 2 个池进程从队列中获取一个项目，处理它然后返回。 This does not make any sense to me.这对我来说没有任何意义。

Your code doesn't show the relationship between your HTTP server and running tasks in your multiprocessing pool and I will not guess what that may be.您的代码没有显示 HTTP 服务器与多处理池中正在运行的任务之间的关系，我不会猜测那可能是什么。 So I will only show the more standard way of using a pool.所以我只会展示使用池的更标准的方法。 I have removed the maxtasksperchild argument because it is only relevant when your pool is executing "normal" tasks that are added to the task queue, for example, using the apply_async or map methods.我删除了maxtasksperchild参数，因为它仅在您的池正在执行添加到任务队列的“正常”任务时才相关，例如，使用apply_async或map方法。 Thus it was not accomplishing anything in your code.因此它没有在您的代码中完成任何事情。

from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from multiprocessing import Pool, current_process


def do_something(item_id):
    worker_id = current_process().pid
    print(f"Worker {worker_id} received id: {item_id}")
    # long_term_operation_that_leaks_memory(item_id)
    # print(f"Worker {worker_id} completed id: {item_id}")

def main():
    # Why only 2 processes in the pool?:
    pool = Pool(processes=2)
    pool.apply_async(do_something, args=('a',))
    pool.apply_async(do_something, args=('b',))
    pool.apply_async(do_something, args=('c',))
    server_address = ("", 8000)
    httpd = ThreadingHTTPServer(server_address, BaseHTTPRequestHandler)
    try:
        httpd.serve_forever()
    except (KeyboardInterrupt, SystemExit):
        pass
    # Wait for submitted tasks to complete:
    pool.close()
    pool.join()

if __name__ == "__main__":
    main()

Prints:印刷：

Worker 15560 received id: a
Worker 8132 received id: b
Worker 15560 received id: c

Answer 2

It seems to me like maybe you don't really need a pool here, and can maybe just create a new Process for each task.在我看来，也许你真的不需要这里的pool ，也许可以为每个任务创建一个新的Process 。 If you want to limit how many tasks exist at once, you can use a Semaphore to limit process creation, and release that semaphore just before each task completes:如果你想限制同时存在多少任务，你可以使用Semaphore来限制进程创建，并在每个任务完成之前释放该信号量：

from multiprocessing import Process, BoundedSemaphore
from time import sleep

def do_work(A, B):
    sleep(.4)
    print(A, B)

def worker(sema, *args):
    try:
        do_work(*args)
    finally:
        sema.release() #allow a new process to be started now that this one is exiting

def main():
    tasks = zip(range(65,91), bytes(range(65,91)).decode())
    sema = BoundedSemaphore(4) #only every 4 workers at a time
    procs = []
    for arglist in tasks:
        sema.acquire() #wait to start until another process is finished
        procs.append(Process(target=worker, args=(sema, *arglist)))
        procs[-1].start()

        #cleanup completed processes
        while not procs[0].is_alive():
            procs.pop(0)
    for p in procs:
        p.join() #wait for any remaining tasks
    print("done")

if __name__ == "__main__":
    main()

如何使用 Python 多处理池永远消耗队列中的项目

问题描述

2 个解决方案

解决方案1
1 2022-11-15 16:52:11

解决方案2
1 2022-11-15 18:45:26

如何使用 Python 多处理池永远消耗队列中的项目

问题描述

2 个解决方案

解决方案1 1 2022-11-15 16:52:11

解决方案2 1 2022-11-15 18:45:26

解决方案1
1 2022-11-15 16:52:11

解决方案2
1 2022-11-15 18:45:26