简体   繁体   English

Python JoinableQueue 在其他进程中调用 task_done 需要两次

[英]Python JoinableQueue call task_done in other process need twice

I have implemented a WorkerManager based on multiprocessing.Process and JoinableQueue .我已经实现了一个基于multiprocessing.ProcessJoinableQueue的 WorkerManager。 While I try to handle the process exceptions like timeout or un-handle exceptions after proc.join(timeout), and evaluate proc.exitcode to determine how to handle, and then call in_queue.task_done() to notify the job has done with the exception-handle logic.虽然我尝试在 proc.join(timeout) 之后处理超时或取消处理异常等进程异常,并评估 proc.exitcode 以确定如何处理,然后调用 in_queue.task_done() 以通知作业已完成异常处理逻辑。 However it need to invoke twice.但是它需要调用两次。 I have no idea why it should be called twice.我不知道为什么它应该被调用两次。 Is there anyone could figure it out the reason here.有没有人可以在这里找出原因。

The whole code snippet:整个代码片段:

# -*- coding=utf-8 -*-

import time
import threading
from queue import Empty
from multiprocessing import Event, Process, JoinableQueue, cpu_count, current_process

TIMEOUT = 3


class WorkersManager(object):

    def __init__(self, jobs, processes_num):
        self._processes_num = processes_num if processes_num else cpu_count()
        self._workers_num = processes_num
        self._in_queue, self._run_queue, self._out_queue = JoinableQueue(), JoinableQueue(), JoinableQueue()
        self._spawned_procs = []
        self._total = 0
        self._stop_event = Event()
        self._jobs_on_procs = {}

        self._wk_kwargs = dict(
            in_queue=self._in_queue, run_queue=self._run_queue, out_queue=self._out_queue,
            stop_event=self._stop_event
        )

        self._in_stream = [j for j in jobs]
        self._out_stream = []
        self._total = len(self._in_stream)

    def run(self):
        # Spawn Worker
        worker_processes = [
            WorkerProcess(i, **self._wk_kwargs) for i in range(self._processes_num)
        ]
        self._spawned_procs = [
            Process(target=process.run, args=tuple())
            for process in worker_processes
        ]

        for p in self._spawned_procs:
            p.start()

        self._serve()

        monitor = threading.Thread(target=self._monitor, args=tuple())
        monitor.start()

        collector = threading.Thread(target=self._collect, args=tuple())
        collector.start()

        self._join_workers()
        # TODO: Terminiate threads
        monitor.join(TIMEOUT)
        collector.join(TIMEOUT)

        self._in_queue.join()
        self._out_queue.join()
        return self._out_stream

    def _join_workers(self):
        for p in self._spawned_procs:
            p.join(TIMEOUT)

            if p.is_alive():
                p.terminate()
                job = self._jobs_on_procs.get(p.name)
                print('Process TIMEOUT: {0} {1}'.format(p.name, job))
                result = {
                    "status": "failed"
                }

                self._out_queue.put(result)
                for _ in range(2):
                    # NOTE: Call task_done twice
                    # Guessing:
                    # 1st time to swtich process?
                    # 2nd time to notify task has done?
                    # TODO: figure it out why?
                    self._in_queue.task_done()
            else:
                if p.exitcode == 0:
                    print("{} exit with code:{}".format(p, p.exitcode))
                else:
                    job = self._jobs_on_procs.get(p.name)
                    if p.exitcode > 0:
                        print("{} with code:{} {}".format(p, p.exitcode, job))
                    else:
                        print("{} been killed with code:{} {}".format(p, p.exitcode, job))

                    result = {
                        "status": "failed"
                    }

                    self._out_queue.put(result)
                    for _ in range(2):
                        # NOTE: Call task_done twice
                        # Guessing:
                        # 1st time to swtich process?
                        # 2nd time to notify task has done?
                        # TODO: figure it out why?
                        self._in_queue.task_done()

    def _collect(self):
        # TODO: Spawn a collector proc
        while True:
            try:
                r = self._out_queue.get()
                self._out_stream.append(r)
                self._out_queue.task_done()

                if len(self._out_stream) >= self._total:
                    print("Total {} jobs done.".format(len(self._out_stream)))
                    self._stop_event.set()
                    break
            except Empty:
                continue

    def _serve(self):
        for job in self._in_stream:
            self._in_queue.put(job)

        for _ in range(self._workers_num):
            self._in_queue.put(None)

    def _monitor(self):
        running = 0
        while True:
            proc_name, job = self._run_queue.get()
            running += 1
            self._jobs_on_procs.update({proc_name: job})
            self._run_queue.task_done()
            if running == self._total:
                break


class WorkerProcess(object):

    def __init__(self, worker_id, in_queue, run_queue, out_queue, stop_event):
        self._worker_id = worker_id
        self._in_queue = in_queue
        self._run_queue = run_queue
        self._out_queue = out_queue
        self._stop_event = stop_event

    def run(self):
        self._work()
        print('worker - {} quit'.format(self._worker_id))

    def _work(self):
        print("worker - {0} start to work".format(self._worker_id))
        job = {}
        while not self._stop_event.is_set():
            try:
                job = self._in_queue.get(timeout=.01)
            except Empty:
                continue

            if not job:
                self._in_queue.task_done()
                break

            try:
                proc = current_process()
                self._run_queue.put((proc.name, job))
                r = self._run_job(job)
                self._out_queue.put(r)
            except Exception as err:
                print('Unhandle exception: {0}'.format(err), exc_info=True)
                result = {"status": 'failed'}
                self._out_queue.put(result)
            finally:
                self._in_queue.task_done()

    def _run_job(self, job):
        time.sleep(job)
        return {
            'status': 'succeed'
        }


def main():

    jobs = [3, 4, 5, 6, 7]
    procs_num = 3
    m = WorkersManager(jobs, procs_num)
    m.run()


if __name__ == "__main__":
    main()

And the issue code as following:问题代码如下:

   self._out_queue.put(result)
                    for _ in range(2):
                        # ISSUE HERE !!!
                        # NOTE: Call task_done twice
                        # Guessing:
                        # 1st time to swtich process?
                        # 2nd time to notify task has done?
                        # TODO: figure it out why?
                        self._in_queue.task_done()

I need to invoke the self._in_queue.task_done() twice to notify the JoinableQueue the job has done by the exception-handle logic.我需要调用self._in_queue.task_done()两次以通知 JoinableQueue 作业已由异常处理逻辑完成。

I guess whether task_done() call 1st time was to switch process context?我猜task_done()第一次调用是否是切换进程上下文? or anything else.或其他任何东西。 according to the testing.根据测试。 the 2nd task_done() has effect.第二个 task_done() 生效。

worker - 0 start to work
worker - 1 start to work
worker - 2 start to work

Process TIMEOUT: Process-1 5
Process TIMEOUT: Process-2 6
Process TIMEOUT: Process-3 7
Total 5 jobs done.

If you call task_done() once, and it will block forever and not to finish.如果您调用 task_done() 一次,它将永远阻塞并且不会完成。

The problem is that you have a race condition, defined as:问题是您有一个竞争条件,定义为:

A race condition arises in software when a computer program, to operate properly, depends on the sequence or timing of the program's processes or threads.当计算机程序要正确运行取决于程序进程或线程的顺序或时间时,软件中就会出现竞争条件。

In method WorkerProcess._work , your main loop begins:在方法WorkerProcess._work中,您的主循环开始:

    while not self._stop_event.is_set():
        try:
            job = self._in_queue.get(timeout=.01)
        except Empty:
            continue

        if not job:
            self._in_queue.task_done()
            break

self._stop_event is being set by the _collect thread. self._stop_event_collect线程设置。 Depending on where WorkerProcess._work is in the loop when this occurs, it can exit the loop leaving the None that has been place on the _in_queue signifying no more jobs.根据发生这种情况时WorkerProcess._work在循环中的位置,它可以退出循环,留下已放置在_in_queue上的None表示不再有作业。 Clearly, this occurs twice for two processes.显然,这对于两个进程发生了两次。 It could happen even for 0, 1, 2 or 3 processes.它甚至可能发生在 0、1、2 或 3 个进程中。

The fix is to replace while not self._stop_event.is_set(): with while True: and to just rely on finding None on the _in_queue to signify termination.修复方法是将while not self._stop_event.is_set():替换为while True:并仅依靠在_in_queue上找到None来表示终止。 This enables you to remove those extra calls to task_done for those processes that have completed normally (you actually only needed one extra call per successfully completed process instead of the two you have).这使您能够为那些已正常完成的进程删除对task_done的额外调用(实际上,每个成功完成的进程只需要一个额外的调用,而不是您拥有的两个)。

But that is half of the problem.但这只是问题的一半。 The other half is you have in your code:另一半是你的代码:

def _join_workers(self):
    for p in self._spawned_procs:
        p.join(TIMEOUT)
        ...
            p.terminate()

Therefore, you are not allowing your workers enough time to deplete the _in_queue and thus there is the possibility of an arbitrary number of messages being left on it (in the example you have, of course, there would be just the current "job" being processed and the None sentinel for a total of 2).因此,您没有让您的工作人员有足够的时间来耗尽_in_queue ,因此可能会在其上留下任意数量的消息(在您的示例中,当然,只有当前的“工作”是处理和None前哨共2)。

But this is the problem in general with the code: it has been over-engineered.但这是代码的一般问题:它被过度设计了。 As an example, referring back to the first code snippet above.例如,请参考上面的第一个代码片段。 It can be further simplified to:可以进一步简化为:

    while True:
        job = self._in_queue.get() # blocking get
        if not job:
            break

Moreover, there is no reason to even be using a JoinableQueue or Event instance since the use of a None sentinel placed on the _in_queue is sufficient to signify that the worker processes should terminate, especially if you are going to be prematurely terminating the workers.此外,甚至没有理由使用JoinableQueueEvent实例,因为使用放置在_in_queue上的None哨兵足以表示工作进程应该终止,特别是如果您要过早终止工作进程。 The simplified, working code is:简化的工作代码是:

import time
import threading
from multiprocessing import Process, Queue, cpu_count, current_process

TIMEOUT = 3


class WorkersManager(object):

    def __init__(self, jobs, processes_num):
        self._processes_num = processes_num if processes_num else cpu_count()
        self._workers_num = processes_num
        self._in_queue, self._run_queue, self._out_queue = Queue(), Queue(), Queue()
        self._spawned_procs = []
        self._total = 0
        self._jobs_on_procs = {}

        self._wk_kwargs = dict(
            in_queue=self._in_queue, run_queue=self._run_queue, out_queue=self._out_queue
        )

        self._in_stream = [j for j in jobs]
        self._out_stream = []
        self._total = len(self._in_stream)

    def run(self):
        # Spawn Worker
        worker_processes = [
            WorkerProcess(i, **self._wk_kwargs) for i in range(self._processes_num)
        ]
        self._spawned_procs = [
            Process(target=process.run, args=tuple())
            for process in worker_processes
        ]

        for p in self._spawned_procs:
            p.start()

        self._serve()

        monitor = threading.Thread(target=self._monitor, args=tuple())
        monitor.start()

        collector = threading.Thread(target=self._collect, args=tuple())
        collector.start()

        self._join_workers()
        # TODO: Terminiate threads
        monitor.join()
        collector.join()

        return self._out_stream

    def _join_workers(self):
        for p in self._spawned_procs:
            p.join(TIMEOUT)

            if p.is_alive():
                p.terminate()
                job = self._jobs_on_procs.get(p.name)
                print('Process TIMEOUT: {0} {1}'.format(p.name, job))
                result = {
                    "status": "failed"
                }

                self._out_queue.put(result)
            else:
                if p.exitcode == 0:
                    print("{} exit with code:{}".format(p, p.exitcode))
                else:
                    job = self._jobs_on_procs.get(p.name)
                    if p.exitcode > 0:
                        print("{} with code:{} {}".format(p, p.exitcode, job))
                    else:
                        print("{} been killed with code:{} {}".format(p, p.exitcode, job))

                    result = {
                        "status": "failed"
                    }

                    self._out_queue.put(result)

    def _collect(self):
        # TODO: Spawn a collector proc
        while True:
            r = self._out_queue.get()
            self._out_stream.append(r)
            if len(self._out_stream) >= self._total:
                print("Total {} jobs done.".format(len(self._out_stream)))
                break

    def _serve(self):
        for job in self._in_stream:
            self._in_queue.put(job)

        for _ in range(self._workers_num):
            self._in_queue.put(None)

    def _monitor(self):
        running = 0
        while True:
            proc_name, job = self._run_queue.get()
            running += 1
            self._jobs_on_procs.update({proc_name: job})
            if running == self._total:
                break


class WorkerProcess(object):

    def __init__(self, worker_id, in_queue, run_queue, out_queue):
        self._worker_id = worker_id
        self._in_queue = in_queue
        self._run_queue = run_queue
        self._out_queue = out_queue

    def run(self):
        self._work()
        print('worker - {} quit'.format(self._worker_id))

    def _work(self):
        print("worker - {0} start to work".format(self._worker_id))
        job = {}
        while True:
            job = self._in_queue.get()
            if not job:
                break

            try:
                proc = current_process()
                self._run_queue.put((proc.name, job))
                r = self._run_job(job)
                self._out_queue.put(r)
            except Exception as err:
                print('Unhandle exception: {0}'.format(err), exc_info=True)
                result = {"status": 'failed'}
                self._out_queue.put(result)

    def _run_job(self, job):
        time.sleep(job)
        return {
            'status': 'succeed'
        }


def main():

    jobs = [3, 4, 5, 6, 7]
    procs_num = 3
    m = WorkersManager(jobs, procs_num)
    m.run()


if __name__ == "__main__":
    main()

Prints:印刷:

worker - 0 start to work
worker - 1 start to work
worker - 2 start to work
Process TIMEOUT: Process-1 3
Process TIMEOUT: Process-2 6
Process TIMEOUT: Process-3 7
Total 5 jobs done.

You are probably aware of this, but due diligence requires that I mention that there are two excellent classes multiprocessing.Pool and concurrent.futures.ProcessPoolExecutor for doing what you want to accomplish.您可能知道这一点,但尽职调查要求我提到有两个出色的类multiprocessing.Poolconcurrent.futures.ProcessPoolExecutor可以完成您想要完成的工作。 See this for some comparisons.请参阅进行一些比较。

Further Explanation进一步说明

What is the point of using a JoinableQueue , which supports calls to task_done ?使用支持调用task_doneJoinableQueue有什么意义? Usually , it is so that you can be sure that all of the messages that you have placed on the queue have been taken off the queue and processed and the main process will not be terminating prematurely before that has occurred.通常,这是为了确保您放置在队列中的所有消息都已从队列中取出并进行处理,并且主进程不会在此之前提前终止。 But this could not work correctly in the code as you had it because you were giving your processes only TIMEOUT seconds to process its messages and then terminating the process if it was still alive with the possibility that messages were still left on its queue.但这在您拥有的代码中无法正常工作,因为您只给进程提供了TIMEOUT秒来处理其消息,然后如果它仍然存在并且消息可能仍留在其队列中,则终止该进程。 This is what forced you to artificially issue extra calls to task_done just so your calls to join on the queues in the main process would not hang and why you had to post this question to begin with.这就是迫使您人为地向task_done发出额外调用的原因,这样您在主进程中join队列的调用就不会挂起,也是您必须首先发布此问题的原因。

So there are two ways you could have proceeded differently.因此,您可以通过两种不同的方式进行操作。 One way would have allowed you to continue using JoinableQueue instances and calling join on these instances to know when to terminate.一种方法是允许您继续使用JoinableQueue实例并在这些实例上调用join以了解何时终止。 But (1) you would not then be able to prematurely terminate your message processes and (2) your message processes must handle exceptions correctly so that they do not prematurely terminate without emptying their queues.但是(1)您将无法过早终止您的消息进程,并且(2)您的消息进程必须正确处理异常,以便它们不会在不清空队列的情况下过早终止。

The other way is what I proposed, which is much simpler.另一种方式是我提出的,更简单。 The main process simply places on the input queue a special sentinel message, in this case None .主进程只是在输入队列上放置一个特殊的标记消息,在本例中为None This is just a message that cannot be mistaken for an actual message to be processed and instead signifies end of file or, in other words, a signal to the message process that there are no more messages that will be placed on the queue and it may now terminate.这只是一条消息,不能被误认为是要处理的实际消息,而是表示文件结束,或者换句话说,向消息进程发出一个信号,表明队列中没有更多消息,它可能现在终止。 Thus, the main process just has to place in addition to the "real" messages to be processed on the queues, the additional sentinel message and then instead of doing a join call on the message queues (which are now only regular, non-joinable queues), it does join(TIMEOUT) on each process instance, which you will either find to be no longer alive because it has seen the sentinel and therefore you know that it has processed all of its messages or you can call terminate on the process if you are willing to leave messages on its input queue.因此,除了要在队列上处理的“真实”消息之外,主进程只需要放置额外的哨兵消息,然后在消息队列上进行join调用(现在只是常规的、不可加入的) queues),它会在每个流程实例上join(TIMEOUT) ,你会发现它不再活着,因为它已经看到了哨兵,因此你知道它已经处理了所有的消息,或者你可以在进程上调用terminate如果您愿意在其输入队列上留下消息。

Of course, to be really sure that processes that terminated on their own really emptied their queue might require you to check their queues to see that they are indeed empty.当然,要真正确定自行终止的进程确实清空了它们的队列,可能需要您检查它们的队列以查看它们确实是空的。 But I assume that you should be able to code your processes to handle exceptions correctly, at least those that can be handled, so that they do not terminate prematurely and do something "reasonable" with every message.但是我假设您应该能够对您的流程进行编码以正确处理异常,至少是那些可以处理的异常,以便它们不会过早终止并对每条消息执行“合理”的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM