繁体   English   中英

Python 多处理:添加到子进程中的队列

[英]Python Multiprocessing: Adding to Queue Within Child Process

我想实现一个将数据存储到 Mongo 的文件爬虫。 我想使用multiprocessing作为“移交”阻塞任务的一种方式,例如解压缩文件、文件爬行和上传到 Mongo。 有些任务依赖于其他任务(即,在抓取文件之前需要解压缩文件),所以我希望能够完成必要的任务并将新任务添加到同一个任务队列中。

以下是我目前拥有的:

import multiprocessing


class Worker(multiprocessing.Process):
    def __init__(self, task_queue: multiprocessing.Queue):
        super(Worker, self).__init__()
        self.task_queue = task_queue

    def run(self):
        for (function, *args) in iter(self.task_queue.get, None):
            print(f'Running: {function.__name__}({*args,})')

            # Run the provided function with its parameters in child process
            function(*args)

            self.task_queue.task_done()


def foo(task_queue: multiprocessing.Queue) -> None:
    print('foo')
    # Add new task to queue from this child process
    task_queue.put((bar, 1))


def bar(x: int) -> None:
    print(f'bar: {x}')


def main():
    # Start workers on separate processes
    workers = []
    manager = multiprocessing.Manager()
    task_queue = manager.Queue()
    for i in range(multiprocessing.cpu_count()):
        worker = Worker(task_queue)
        workers.append(worker)
        worker.start()

    # Run foo on child process using the queue as parameter
    task_queue.put((foo, task_queue))

    for _ in workers:
        task_queue.put(None)

    # Block until workers complete and join main process
    for worker in workers:
        worker.join()

    print('Program completed.')


if __name__ == '__main__':
    main()

预期行为:

Running: foo((<AutoProxy[Queue] object, typeid 'Queue' at 0x1b963548908>,))
foo
Running: bar((1,))
bar: 1
Program completed.

实际行为:

Running: foo((<AutoProxy[Queue] object, typeid 'Queue' at 0x1b963548908>,))
foo
Program completed.

我对多处理很陌生,所以任何帮助都将不胜感激。

正如@FrankYellin 指出的那样,这是因为在添加bar之前None被放入task_queue中。

假设队列要么是非空的,要么在程序期间等待任务完成(在我的例子中是这样),可以使用队列上的join方法。 根据 文档

阻塞,直到队列中的所有项目都已被获取和处理。

每当将项目添加到队列中时,未完成任务的计数就会增加。 每当消费者线程调用 task_done() 以指示该项目已被检索并且所有工作已完成时,计数就会下降。 当未完成任务的计数降至零时,join() 会解除阻塞。

以下是更新后的代码:

import multiprocessing


class Worker(multiprocessing.Process):
    def __init__(self, task_queue: multiprocessing.Queue):
        super(Worker, self).__init__()
        self.task_queue = task_queue

    def run(self):
        for (function, *args) in iter(self.task_queue.get, None):
            print(f'Running: {function.__name__}({*args,})')

            # Run the provided function with its parameters in child process
            function(*args)

            self.task_queue.task_done() # <-- Notify queue that task is complete


def foo(task_queue: multiprocessing.Queue) -> None:
    print('foo')
    # Add new task to queue from this child process
    task_queue.put((bar, 1))


def bar(x: int) -> None:
    print(f'bar: {x}')


def main():
    # Start workers on separate processes
    workers = []
    manager = multiprocessing.Manager()
    task_queue = manager.Queue()
    for i in range(multiprocessing.cpu_count()):
        worker = Worker(task_queue)
        workers.append(worker)
        worker.start()

    # Run foo on child process using the queue as parameter
    task_queue.put((foo, task_queue))

    # Block until all items in queue are popped and completed
    task_queue.join() # <---

    for _ in workers:
        task_queue.put(None)

    # Block until workers complete and join main process
    for worker in workers:
        worker.join()

    print('Program completed.')


if __name__ == '__main__':
    main()

这似乎工作正常。 如果我发现任何新内容,我会更新此内容。 谢谢你们。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM