多處理python無法並行運行

Question

我一直在嘗試使用python的多處理模塊來實現對計算量很大的任務的並行處理。

我可以執行我的代碼，但是它不能並行運行。 我一直在閱讀多處理程序的手冊頁和論壇，以了解為什么它不起作用，而且我還沒有弄清楚。

我認為問題可能與執行我創建和導入的其他模塊時的某種鎖定有關。

這是我的代碼：

main.py：

##import my modules
import prepare_data
import filter_part
import wrapper_part
import utils
from myClasses import ML_set
from myClasses import data_instance

n_proc = 5

def main():
    if __name__ == '__main__':
        ##only main process should run this
        data = prepare_data.import_data() ##read data from file  
        data = prepare_data.remove_and_correct_outliers(data)
        data = prepare_data.normalize_data_range(data)
        features = filter_part.filter_features(data)

        start_t = time.time()
        ##parallelism will be used on this part
        best_subset = wrapper_part.wrapper(n_proc, data, features)

        print time.time() - start_t


main()

wrapper_part.py：

##my modules
from myClasses import ML_set
from myClasses import data_instance
import utils

def wrapper(n_proc, data, features):

    p_work_list = utils.divide_features(n_proc-1, features)
    n_train, n_test = utils.divide_data(data)

    workers = []

    for i in range(0,n_proc-1):
        print "sending process:", i
        p = mp.Process(target=worker_classification, args=(i, p_work_list[i], data, features, n_train, n_test))
        workers.append(p)
        p.start()

    for worker in workers:
        print "waiting for join from worker"
        worker.join()


    return


def worker_classification(id, work_list, data, features, n_train, n_test):
    print "Worker ", id, " starting..."
    best_acc = 0
    best_subset = []
    while (work_list != []):
        test_subset = work_list[0]
        del(work_list[0])
        train_set, test_set = utils.cut_dataset(n_train, n_test, data, test_subset)
        _, acc = classification_decision_tree(train_set, test_set)
        if acc > best_acc:
            best_acc = acc
            best_subset = test_subset
    print id, " found best subset ->  ", best_subset, " with accuracy: ", best_acc

所有其他模塊均不使用多處理模塊，並且工作正常。 在此階段，我僅測試並行處理，甚至沒有嘗試返回結果，因此進程之間沒有任何通信，也沒有共享內存變量。 每個進程都會使用一些變量，但是據我所知，它們是在生成進程之前定義的，我相信每個進程都有自己的變量副本。

作為5個進程的輸出，我得到了：

importing data from file...
sending process: 0
sending process: 1
Worker  0  starting...
0  found best subset ->   [2313]  with accuracy:  60.41
sending process: 2
Worker  1  starting...
1  found best subset ->   [3055]  with accuracy:  60.75
sending process: 3
Worker  2  starting...
2  found best subset ->   [3977]  with accuracy:  62.8
waiting for join from worker
waiting for join from worker
waiting for join from worker
waiting for join from worker
Worker  3  starting...
3  found best subset ->   [5770]  with accuracy:  60.07
55.4430000782

4個過程執行並行部分大約需要55秒。 僅用1個進程進行測試，執行時間為16秒：

importing data from file...
sending process: 0
waiting for join from worker
Worker  0  starting...
0  found best subset ->   [5870]  with accuracy:  63.32
16.4409999847

我在python 2.7和Windows 8上運行

編輯

我在ubuntu上測試了我的代碼，它起作用了，我認為Windows 8和python出了點問題。 這是ubuntu上的輸出：

importing data from file...
size trainset:  792  size testset:  302
sending process: 0
sending process: 1
Worker  0  starting...
sending process: 2
Worker  1  starting...
sending process: 3
Worker  2  starting...
waiting for join from worker
Worker  3  starting...
2  found best subset ->   [5199]  with accuracy:  60.93
1  found best subset ->   [3198]  with accuracy:  60.93
0  found best subset ->   [1657]  with accuracy:  61.26
waiting for join from worker
waiting for join from worker
waiting for join from worker
3  found best subset ->   [5985]  with accuracy:  62.25
6.1428809166

從現在開始，我將開始使用ubuntu進行測試，但是我想知道為什么代碼無法在Windows上運行。

Answer 1

確保閱讀multiprocessing手冊中的Windows准則： https : //docs.python.org/2/library/multiprocessing.html#windows

特別是“安全導入主模塊”：

相反，應該使用if __name__ == '__main__':來保護程序的“入口點”：

您在上面顯示的第一個代碼段中違反了此規則，因此我沒有比這更進一步。 希望您所觀察到的問題的解決方案像包含此保護一樣簡單。

這很重要的原因：在類似Unix的系統上，子進程是通過分支創建的。 在這種情況下，操作系統將創建創建派生的進程的精確副本。 也就是說，所有狀態都由子級從父級繼承。 例如，這意味着定義了所有函數和類。

在Windows上，沒有這樣的系統調用。 Python需要執行繁重的任務，在子級中創建一個新的Python解釋器會話，然后重新創建（逐步）父級的狀態。 例如，所有功能和類都需要重新定義。 這就是為什么重型import機制在Windows上的Python多處理子程序的支持下進行的原因。 當孩子導入主模塊時，該機器啟動。 在您的情況下，這牽涉到子級中對main()的調用！ 當然，您不希望那樣。

您可能會發現這很乏味。 我發現令人印象深刻的是， multiprocessing模塊能夠為兩個截然不同的平台提供相同功能的接口。 實際上，就過程處理而言，與POSIX兼容的操作系統和Windows是如此不同，以至於固有地很難提出一種適用於兩者的抽象。

多處理python無法並行運行

問題描述

1 個解決方案

解決方案1
2 2015-02-28 22:23:39

多處理python無法並行運行

問題描述

1 個解決方案

解決方案1 2 2015-02-28 22:23:39

解決方案1
2 2015-02-28 22:23:39