SCOOP - 如何讓工作人員在繼續之前等待根工作人員

Question

我在工作中使用 SCOOP（和 Python 3.6 - 無法更新）。 我需要所有工作人員執行計算，然后等待根節點執行緩慢的計算（ if __name__ == '__main__':中的代碼），然后使用根節點計算產生的 dataframe 執行另一次計算。

我的問題是 SCOOP 立即啟動所有工作人員，他們嘗試異步運行if __name__ == '__main__':之外的所有代碼，即使它在if塊下方。 由於還沒有 dataframe，他們拋出一個錯誤。

什么命令可以強制所有worker等待root worker完成一次計算，然后再繼續運行代碼rest？

我曾嘗試使用scoop.futures.map 、 scoop.futures.supply和multiprocessing.managers進行試驗，但均未成功。 我也嘗試過使用multiprocessing.Barrier(8).wait()但它不起作用。

有一個scoop.futures.wait(futures)方法，但我不知道如何獲得futures參數......

我有類似的東西：

import pandas as pd
import genetic_algorithm
from scoop import futures

df = pd.read_csv('database.csv') # dataframe is to large to be passed to fitness_function for every worker. I want every worker to have a copy of it!

if __name__ == '__main__':
    df = add_new_columns(df) # heavy computation which I just want to perform once (not by all workers)

df = computation_using_new_columns(df) # <--- !!! error - is executed before slow add_new_columns(df) finishes

def fitness_function(): ... # all workers use fitness_function() and an error is thrown if I put it inside the if __name__ == '__main__':

if __name__ == '__main__':
    results = list(futures.map(genetic_algorithm, df))

並使用python3 -m scoop script.py執行腳本，它會立即啟動所有工作人員...

Answer 1

每個進程都有自己的memory空間，在主進程中修改dataframe不會影響到worker，需要在處理后使用某種初始化器將其傳遞給worker，這在主進程中似乎不可用SCOOP 框架，一個更靈活（但稍微復雜）的工具是 python 的內置multiprocessing.Pool模塊。

import pandas as pd
import genetic_algorithm
from multiprocessing import Pool

def fitness_function(): ...

def initializer_func(df_from_parent):
    global df
    df = df_from_parent
    df = computation_using_new_columns(df)

if __name__ == '__main__':
    df = pd.read_csv(
        'database.csv')  
    # read the df in the main process only as it needs to be modified
    # before sending it to the workers

    df = add_new_columns(df)  # modify the df in the main process
    # create as much workers as your cpu cores, and passes the df to them, and have each worker
    # execute the computation_using_new_columns on it
    with Pool(initializer=initializer_func, initargs=(df,)) as pool:
        results = list(pool.imap(genetic_algorithm, df))  # now do your function

如果computation_using_new_columns需要在每個 worker 中執行，那么你可以將它保留在初始化程序中，但如果它只需要執行一次，那么你可以將它放在add_new_columns之后的if __name__ == "__main__"中。

SCOOP - 如何讓工作人員在繼續之前等待根工作人員

問題描述

1 個解決方案

解決方案1
1 已采納 2023-01-25 17:44:12

SCOOP - 如何讓工作人員在繼續之前等待根工作人員

問題描述

1 個解決方案

解決方案1 1 已采納 2023-01-25 17:44:12

解決方案1
1 已采納 2023-01-25 17:44:12