在將結果附加到字典的 for 循環上使用 python 多處理

Question

因此，我查看了多處理模塊的文檔以及此處提出的其他問題，似乎沒有一個與我的情況相似，因此我開始了一個新問題。

為簡單起見，我有一段如下形式的代碼：

# simple dataframe of some users and their properties.
data = {'userId': [1, 2, 3, 4],
        'property': [12, 11, 13, 43]}
df = pd.DataFrame.from_dict(data)

# a function that generates permutations of the above users, in the form of a list of lists
# such as [[1,2,3,4], [2,1,3,4], [2,3,4,1], [2,4,1,3]]
user_perm = generate_permutations(nr_perm=4)

# a function that computes some relation between users
def comp_rel(df, permutation, user_dict):
    df1 = df.userId.isin(permutation[0])
    df2 = df.userId.isin(permutation[1])
    user_dict[permutation[0]] += permutation[1]
    return user_dict


# and finally a loop: 
user_dict = defaultdict(int)
for permutation in user_perm:
    user_dict = comp_rel(df, permutation, user_dict)

我知道這段代碼現在沒有什么意義（如果有的話），但我只是寫了一個小例子，它接近我正在處理的實際代碼的結構。 這user_dict應該終於包含userIds和一定的價值。

我有實際的代碼，它工作正常，給出了正確的字典和一切，但是......它在單個線程上運行。 它非常緩慢，請記住我還有另外 15 個完全免費的線程。

我的問題是，如何使用 python 的multiprocessing模塊來更改最后一個 for 循環，並能夠在所有可用線程/內核上運行？ 我看了文檔，不是很容易理解。

編輯：我正在嘗試將池用作：

p = multiprocessing.Pool(multiprocessing.cpu_count())
p.map(comp_rel(df, permutation, user_dict), user_perm)
p.close()
p.join()

但是這會中斷，因為我正在使用該行：

user_dict = comp_rel(df, permutation, user_dict)

在初始代碼中，我不知道在池完成后應該如何合並這些字典。

Answer 1

您的comp_rel有兩部分需要分開 - 首先是計算本身，它正在為某些用戶 ID 計算一些值。 第二個是“累積”步驟，該步驟將該值添加到user_dict結果中。

你可以計算本身分離，使得它返回的元組(id, value)和多處理外包出去，然后在主線程之后積累的結果：

from multiprocessing import Pool
from functools import partial
from collections import defaultdict

# We make this a pure function that just returns a result instead of mutating anything
def comp_rel(df, perm):
    ...
    return perm[0], perm[1]

comp_with_df = partial(comp_rel, df) # df is always the same, so factor it out
with Pool(None) as pool: # Pool(None) uses cpu_count automatically
    results = pool.map(comp_with_df, user_perm)

# Now add up the results at the end:
user_dict = defaultdict(int)
for k, v in results:
    user_dict[k] += v

或者，您也可以直接將Manager().dict()對象傳遞到處理函數中，但這有點復雜，並且可能不會給您帶來任何額外的速度。

根據@Masklinn 的建議，這里有一個更好的方法來避免內存開銷：

user_dict = defaultdict(int)
with Pool(None) as pool:
    for k, v in pool.imap_unordered(comp_with_df, user_perm):
        user_dict[k] += v

通過這種方式，我們將完成的結果相加，而不必先將它們全部存儲在中間列表中。

Answer 2

在評論中進行簡短討論后，我決定使用ProcessPoolExecutor發布解決方案：

import concurrent.futures
from collections import defaultdict

def comp_rel(df, perm):
    ...
    return perm[0], perm[1]

user_dict = defaultdict(int)
with concurrent.futures.ProcessPoolExecutor() as executor:
    futures = {executor.submit(comp_rel, df, perm): perm for perm in user_perm}
    for future in concurrent.futures.as_completed(futures):
        try:
            k, v = future.result()
        except Exception as e:
            print(f"{futures[future]} throws {e}")
        else:
            user_dict[k] += v

它的工作原理與@tzaman 相同，但它為您提供了處理異常的可能性。 此模塊中還有更多有趣的功能，請查看docs 。

在將結果附加到字典的 for 循環上使用 python 多處理

問題描述

2 個解決方案

解決方案1
2 2020-02-04 11:04:27

解決方案2
2 已采納 2020-02-04 13:22:42

在將結果附加到字典的 for 循環上使用 python 多處理

問題描述

2 個解決方案

解決方案1 2 2020-02-04 11:04:27

解決方案2 2 已采納 2020-02-04 13:22:42

解決方案1
2 2020-02-04 11:04:27

解決方案2
2 已采納 2020-02-04 13:22:42