如何在多處理器系統上生成並行子進程？

Question

我有一個Python腳本，我想用作另一個Python腳本的控制器。 我有一個64位處理器的服務器，所以想要產生第二個Python腳本的64個子進程。 子腳本被調用：

$ python create_graphs.py --name=NAME

其中NAME是XYZ，ABC，NYU等。

在我的父控制器腳本中，我從列表中檢索名稱變量：

my_list = [ 'XYZ', 'ABC', 'NYU' ]

所以我的問題是，作為孩子產生這些過程的最佳方法是什么？ 我想一次將子項數限制為64，因此需要跟蹤狀態（如果子進程已完成或未完成），這樣我就可以有效地保持整個代的運行。

我查看了使用子進程包，但拒絕它，因為它一次只生成一個子進程。 我終於找到了多處理器包，但我承認被整個線程與子進程文檔所淹沒。

現在，我的腳本使用subprocess.call只生成一個子subprocess.call ，如下所示：

#!/path/to/python
import subprocess, multiprocessing, Queue
from multiprocessing import Process

my_list = [ 'XYZ', 'ABC', 'NYU' ]

if __name__ == '__main__':
    processors = multiprocessing.cpu_count()

    for i in range(len(my_list)):
        if( i < processors ):
             cmd = ["python", "/path/to/create_graphs.py", "--name="+ my_list[i]]
             child = subprocess.call( cmd, shell=False )

我真的希望它一次產生64個孩子。 在其他stackoverflow問題中，我看到人們使用Queue，但似乎這會產生性能損失？

Answer 1

您正在尋找的是多處理中的進程池類。

import multiprocessing
import subprocess

def work(cmd):
    return subprocess.call(cmd, shell=False)

if __name__ == '__main__':
    count = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=count)
    print pool.map(work, ['ls'] * count)

這是一個計算示例，使其更容易理解。 以下將在N個進程上划分10000個任務，其中N是cpu計數。 請注意，我正在傳遞None作為進程數。 這將導致Pool類使用cpu_count進行進程數（引用）

import multiprocessing
import subprocess

def calculate(value):
    return value * 10

if __name__ == '__main__':
    pool = multiprocessing.Pool(None)
    tasks = range(10000)
    results = []
    r = pool.map_async(calculate, tasks, callback=results.append)
    r.wait() # Wait on the results
    print results

Answer 2

根據Nadia和Jim的評論，這是我提出的解決方案。 我不確定它是否是最佳方式，但它確實有效。 被調用的原始子腳本需要是一個shell腳本，因為我需要使用一些第三方應用程序，包括Matlab。 所以我不得不把它從Python中取出並用bash編寫代碼。

import sys
import os
import multiprocessing
import subprocess

def work(staname):
    print 'Processing station:',staname
    print 'Parent process:', os.getppid()
    print 'Process id:', os.getpid()
    cmd = [ "/bin/bash" "/path/to/executable/create_graphs.sh","--name=%s" % (staname) ]
    return subprocess.call(cmd, shell=False)

if __name__ == '__main__':

    my_list = [ 'XYZ', 'ABC', 'NYU' ]

    my_list.sort()

    print my_list

    # Get the number of processors available
    num_processes = multiprocessing.cpu_count()

    threads = []

    len_stas = len(my_list)

    print "+++ Number of stations to process: %s" % (len_stas)

    # run until all the threads are done, and there is no data left

    for list_item in my_list:

        # if we aren't using all the processors AND there is still data left to
        # compute, then spawn another thread

        if( len(threads) < num_processes ):

            p = multiprocessing.Process(target=work,args=[list_item])

            p.start()

            print p, p.is_alive()

            threads.append(p)

        else:

            for thread in threads:

                if not thread.is_alive():

                    threads.remove(thread)

這似乎是一個合理的解決方案嗎？ 我嘗試使用Jim的while循環格式，但我的腳本什么都沒有返回。 我不確定為什么會這樣。 這是我用Jim的'while'循環替換'for'循環運行腳本時的輸出：

hostname{me}2% controller.py 
['ABC', 'NYU', 'XYZ']
Number of processes: 64
+++ Number of stations to process: 3
hostname{me}3%

當我使用'for'循環運行它時，我得到了更有意義的東西：

hostname{me}6% controller.py 
['ABC', 'NYU', 'XYZ']
Number of processes: 64
+++ Number of stations to process: 3
Processing station: ABC
Parent process: 1056
Process id: 1068
Processing station: NYU
Parent process: 1056
Process id: 1069
Processing station: XYZ
Parent process: 1056
Process id: 1071
hostname{me}7%

所以這很有效，我很高興。 但是，我仍然不明白為什么我不能使用Jim的'while'樣式循環而不是我正在使用的'for'循環。 感謝所有的幫助 - 我對@ stackoverflow的廣泛知識印象深刻。

Answer 3

我肯定會使用多處理，而不是使用子進程滾動我自己的解決方案。

Answer 4

除非你打算從應用程序中獲取數據，否則我認為你不需要隊列（如果你確實需要數據，我認為無論如何都可能更容易將它添加到數據庫中）

但嘗試這個大小：

將create_graphs.py腳本的內容全部放入名為“create_graphs”的函數中

import threading
from create_graphs import create_graphs

num_processes = 64
my_list = [ 'XYZ', 'ABC', 'NYU' ]

threads = []

# run until all the threads are done, and there is no data left
while threads or my_list:

    # if we aren't using all the processors AND there is still data left to
    # compute, then spawn another thread
    if (len(threads) < num_processes) and my_list:
        t = threading.Thread(target=create_graphs, args=[ my_list.pop() ])
        t.setDaemon(True)
        t.start()
        threads.append(t)

    # in the case that we have the maximum number of threads check if any of them
    # are done. (also do this when we run out of data, until all the threads are done)
    else:
        for thread in threads:
            if not thread.isAlive():
                threads.remove(thread)

我知道這將導致比處理器少1個線程，這可能是好的，它留下了一個處理器來管理線程，磁盤i / o和計算機上發生的其他事情。 如果您決定要使用最后一個核心，只需添加一個核心即可

編輯：我想我可能誤解了my_list的目的。 您根本不需要my_list來跟蹤線程（因為它們都被threads列表中的項引用）。 但這是一個很好的方式來輸入過程輸入 - 甚至更好：使用生成器功能;）

`my_list`和`threads`的目的

my_list包含您需要在函數中處理的數據
threads只是當前正在運行的線程的列表

while循環做兩件事，啟動新線程來處理數據，並檢查是否有任何線程運行完畢。

因此，只要您有（a）要處理的更多數據，或（b）未完成運行的線程....您希望編程繼續運行。 一旦兩個列表都為空，它們將評估為False ，而while循環將退出

如何在多處理器系統上生成並行子進程？

問題描述

4 個解決方案

解決方案1
60 已采納 2009-05-19 20:26:13

解決方案2
2 2009-06-17 16:16:38

解決方案3
1 2009-05-19 20:04:01

解決方案4
1 2009-05-19 20:04:26

`my_list`和`threads`的目的

如何在多處理器系統上生成並行子進程？

問題描述

4 個解決方案

解決方案1 60 已采納 2009-05-19 20:26:13

解決方案2 2 2009-06-17 16:16:38

解決方案3 1 2009-05-19 20:04:01

解決方案4 1 2009-05-19 20:04:26

my_list和threads的目的

解決方案1
60 已采納 2009-05-19 20:26:13

解決方案2
2 2009-06-17 16:16:38

解決方案3
1 2009-05-19 20:04:01

解決方案4
1 2009-05-19 20:04:26

`my_list`和`threads`的目的