使用多处理工作池

Question

I have the following code written to make my lazy second CPU core working. 我编写了以下代码，以使我的第二个CPU核心工作。 What the code does basically is first find the desired "sea" files in the directory hierarchy and later execute set of external scripts to process these binary "sea" files to produce 50 to 100 text and binary files in number. 代码基本上做的是首先在目录层次结构中找到所需的“海”文件，然后执行外部脚本集来处理这些二进制“海”文件，以生成50到100个文本和二进制文件。 As the title of the question suggest in a paralleled fashion to increase the processing speed. 由于问题的标题以平行的方式提出，以提高处理速度。

This question originates from the long discussion that we have been having on IPython users list titled as " Cannot start ipcluster ". 这个问题起源于我们在IPython用户列表中长期讨论的标题为“ 无法启动ipcluster ”。 Starting with my experimentation on IPython's parallel processing functionalities. 从我对IPython并行处理功能的实验开始。

The issue is I can't get this code running correctly. 问题是我无法正常运行此代码。 If the folders that contain "sea" files only houses "sea" files the script finishes its execution without fully performing external script runs. 如果包含“sea”文件的文件夹仅包含“sea”文件，则脚本将在不完全执行外部脚本运行的情况下完成其执行。 (Say I have 30-50 external scripts to run, but my multiprocessing enabled script exhaust only after executing the first script in these external script chain.) Interestingly, if I run this script on an already processed folder (which is "sea" files processed beforehand and output files are already in that folder) then it runs, but this time I get speed-ups at about 2.4 to 2.7X with respect to linear processing timings. （假设我有30-50个外部脚本要运行，但是我的多处理启用脚本只有在执行这些外部脚本链中的第一个脚本后才会耗尽。）有趣的是，如果我在已处理的文件夹（即“海”文件）上运行此脚本预先处理并且输出文件已经在该文件夹中）然后它运行，但是这次我的速度提升到大约2.4到2.7X，相对于线性处理时序。 It is not very expected since I only have a Core 2 Duo 2.5 Ghz CPU in my laptop. 由于笔记本电脑中只有Core 2 Duo 2.5 Ghz CPU，因此不是很理想。 Although I have a CUDA powered GPU it has nothing to do with my current parallel computing struggle :) 虽然我有一个CUDA驱动的GPU，但它与我目前的并行计算难度无关:)

What do you think might be source of this issue? 您认为这个问题的来源是什么？

Thank you for all comments and suggestions. 感谢您的所有意见和建议。

#!/usr/bin/env python

from multiprocessing import Pool
from subprocess import call
import os


def find_sea_files():

   file_list, path_list = [], []
   init = os.getcwd()

   for root, dirs, files in os.walk('.'):
      dirs.sort()
      for file in files:
          if file.endswith('.sea'):
              file_list.append(file)
              os.chdir(root)
              path_list.append(os.getcwd())
              os.chdir(init)

   return file_list, path_list


def process_all(pf):
   os.chdir(pf[0])
   call(['postprocessing_saudi', pf[1]])


if __name__ == '__main__':
   pool = Pool(processes=2)              # start 2 worker processes
   files, paths = find_sea_files()
   pathfile = [[paths[i],files[i]] for i in range(len(files))]
   pool.map(process_all, pathfile)

Answer 1

I would start with getting a better feeling for what is going on with the worker process. 我首先要对工作流程的情况有一个更好的感觉。 The multiprocessing module comes with logging for its subprocesses if you need. 如果需要，多处理模块会为其子进程记录日志。 Since you have simplified the code to narrow down the problem, I would just debug with a few print statements, like so (or you can PrettyPrint the pf array): 由于您已经简化了代码以缩小问题范围，我只需要使用一些打印语句进行调试，就像这样（或者您可以使用PrettyPrint pf数组）：


def process_all(pf):
   print "PID: ", os.getpid()
   print "Script Dir: ", pf[0]
   print "Script: ", pf[1]
   os.chdir(pf[0])
   call(['postprocessing_saudi', pf[1]])


if __name__ == '__main__':
   pool = Pool(processes=2)
   files, paths = find_sea_files()
   pathfile = [[paths[i],files[i]] for i in range(len(files))]
   pool.map(process_all, pathfile, 1) # Ensure the chunk size is 1
   pool.close()
   pool.join()

The version of Python that I have accomplished this with 2.6.4. 我用2.6.4完成的Python版本。

Answer 2

There are several things I can think of: 有几件事我能想到：

1) Have you printed out the pathfiles? 1）你打印出了路径文件吗？ Are you sure that they are all properly generated? 你确定它们都是正确生成的吗？

a) I ask as your os.walk is a bit interesting; a）我问你的os.walk有点有趣; the dirs.sort() should be ok, but seems quite unncessarily. dirs.sort（）应该没问题，但似乎非常不合时宜。 os.chdir() in general shouldn't be used; 一般不应使用os.chdir（）; the restoration should be alright, but in general you should just be appending root to init. 恢复应该没问题 ，但一般来说你应该只是将root添加到init。

2) I've seen multiprocessing on python2.6 have problems spawning subporcesses from pools. 2）我已经看到python2.6上的多处理有问题从池中产生subporcesses。 (I specifically had a script use multiprocessing to spawn subprocesses. Those subprocesses then could not correctly use multiprocessing (the pool locked up)). （我特意让一个脚本使用多处理来产生子进程。那些子进程然后无法正确使用多处理（池被锁定））。 Try python2.5 w/ the mulitprocessing backport. 尝试使用python2.5 w / mulitprocessing backport。

3) Try picloud 's cloud.mp module (which wraps multiprocessing, but handles pools a tad differently) and see if that works. 3）尝试picloud的cloud.mp模块（包装多处理，但处理池有点不同），看看是否有效。

You would do 你会的

cloud.mp.join(cloud.mp.map(process_all, pathfile))

(Disclaimer: I am one of the developers of PiCloud) （免责声明：我是PiCloud的开发者之一）

使用多处理工作池

问题描述

2 个解决方案

解决方案1
6 已采纳 2009-12-21 18:21:30

解决方案2
3 2009-12-02 05:13:11

使用多处理工作池

问题描述

2 个解决方案

解决方案1 6 已采纳 2009-12-21 18:21:30

解决方案2 3 2009-12-02 05:13:11

解决方案1
6 已采纳 2009-12-21 18:21:30

解决方案2
3 2009-12-02 05:13:11