简体繁体 English

python队列并发过程管理

[英]python queue concurrency process management

原文 2010-11-03 10:55:11 2 3 python/ concurrency/ process/ queue

The use case is as follows : I have a script that runs a series of non-python executables to reduce (pulsar) data. 用例如下：我有一个脚本，该脚本运行一系列非python可执行文件以减少（脉冲）数据。 I right now use subprocess.Popen(..., shell=True) and then the communicate function of subprocess to capture the standard out and standard error from the non-python executables and the captured output I log using the python logging module. 我现在使用subprocess.Popen（...，shell = True），然后使用子进程的通讯功能从非python可执行文件中捕获标准输出和标准错误，并使用python日志记录模块记录捕获的输出。

The problem is: just one core of the possible 8 get used now most of the time. 问题是：大多数时间现在只使用了8个内核中的一个。 I want to spawn out multiple processes each doing a part of the data set in parallel and I want to keep track of progres. 我想产生多个进程，每个进程并行处理数据集的一部分，并且希望跟踪进度。 It is a script / program to analyze data from a low frequencey radio telescope (LOFAR). 它是一个脚本/程序，用于分析来自低频射电望远镜（LOFAR）的数据。 The easier to install / manage and test the better. 安装/管理和测试越容易越好。 I was about to build code to manage all this but im sure it must already exist in some easy library form. 我本来要构建代码来管理所有这些，但是我确定它必须已经以某种简单的库形式存在。

3 个解决方案

也许芹菜会满足您的需求。

The subprocess module can start multiple processes for you just fine, and keep track of them. subprocess模块可以为您启动多个流程，并对其进行跟踪。 The problem, though, is reading the output from each process without blocking any other processes. 但是，问题是从每个进程读取输出而没有阻塞任何其他进程。 Depending on the platform there's several ways of doing this: using the select module to see which process has data to be read, setting the output pipes non-blocking using the fnctl module, using threads to read each process's data (which subprocess.Popen.communicate itself uses on Windows, because it doesn't have the other two options.) In each case the devil is in the details, though. 根据平台的不同，有几种方法可以执行此操作：使用select模块查看哪个进程要读取数据，使用fnctl模块将输出管道设置为非阻塞，使用线程读取每个进程的数据（哪个subprocess.Popen.communicate本身在Windows上使用，因为它没有其他两个选项。）但是，每种情况下的细节都是魔鬼。

Something that handles all this for you is Twisted , which can spawn as many processes as you want, and can call your callbacks with the data they produce (as well as other situations.) Twisted可以为您处理所有这些事情，它可以产生所需数量的进程，并可以使用产生的数据调用回调（以及其他情况）。

If I understand correctly what you are doing, I might suggest a slightly different approach. 如果我正确理解您在做什么，我可能会建议您使用一种稍微不同的方法。 Try establishing a single unit of work as a function and then layer on the parallel processing after that. 尝试将单个工作单元作为函数建立，然后在并行处理上分层。 For example: 例如：

Wrap the current functionality (calling subprocess and capturing output) into a single function. 将当前功能（调用子流程和捕获输出）包装为一个功能。 Have the function create a result object that can be returned; 让函数创建一个可以返回的结果对象； alternatively, the function could write out to files as you see fit. 或者，该函数可以按您认为合适的方式写出文件。
Create an iterable (list, etc.) that contains an input for each chunk of data for step 1. 创建一个可迭代的（列表等），其中包含步骤1的每个数据块的输入。
Create a multiprocessing Pool and then capitalize on its map() functionality to execute your function from step 1 for each of the items in step 2. See the python multiprocessing docs for details. 创建一个多处理池，然后利用其map（）功能对步骤2中的每个项目执行步骤1中的功能。有关详细信息，请参见python multiprocessing文档。

You could also use a worker/Queue model. 您还可以使用工作程序/队列模型。 The key, I think, is to encapsulate the current subprocess/output capture stuff into a function that does the work for a single chunk of data (whatever that is). 我认为，关键是将当前的子进程/输出捕获内容封装到一个函数中，该函数可以处理单个数据块（无论是什么）。 Layering on the parallel processing piece is then quite straightforward using any of several techniques, only a couple of which were described here. 然后，使用几种技术中的任何一种，在并行处理块上进行分层都非常简单，这里仅介绍其中的几种。