在将结果附加到字典的 for 循环上使用 python 多处理

Question

So I've looked at both the documentation of the multiprocessing module, and also at the other questions asked here, and none seem to be similar to my case, hence I started a new question.因此，我查看了多处理模块的文档以及此处提出的其他问题，似乎没有一个与我的情况相似，因此我开始了一个新问题。

For simplicity, I have a piece of code of the form:为简单起见，我有一段如下形式的代码：

# simple dataframe of some users and their properties.
data = {'userId': [1, 2, 3, 4],
        'property': [12, 11, 13, 43]}
df = pd.DataFrame.from_dict(data)

# a function that generates permutations of the above users, in the form of a list of lists
# such as [[1,2,3,4], [2,1,3,4], [2,3,4,1], [2,4,1,3]]
user_perm = generate_permutations(nr_perm=4)

# a function that computes some relation between users
def comp_rel(df, permutation, user_dict):
    df1 = df.userId.isin(permutation[0])
    df2 = df.userId.isin(permutation[1])
    user_dict[permutation[0]] += permutation[1]
    return user_dict


# and finally a loop: 
user_dict = defaultdict(int)
for permutation in user_perm:
    user_dict = comp_rel(df, permutation, user_dict)

I know this code makes very little (if any) sense right now, but I just wrote a small example that is close to the structure of the actual code that I am working on.我知道这段代码现在没有什么意义（如果有的话），但我只是写了一个小例子，它接近我正在处理的实际代码的结构。 That user_dict should finally contain userIds and some value.这user_dict应该终于包含userIds和一定的价值。

I have the actual code, and it works fine, gives the correct dict and everything, but... it runs on a single thread.我有实际的代码，它工作正常，给出了正确的字典和一切，但是......它在单个线程上运行。 And it's painfully slow, keeping in mind that I have another 15 threads totally free.它非常缓慢，请记住我还有另外 15 个完全免费的线程。

My question is, how can I use the multiprocessing module of python to change the last for loop, and be able to run on all threads/cores available?我的问题是，如何使用 python 的multiprocessing模块来更改最后一个 for 循环，并能够在所有可用线程/内核上运行？ I looked at the documentation, it's not very easy to understand.我看了文档，不是很容易理解。

EDIT: I am trying to use pool as:编辑：我正在尝试将池用作：

p = multiprocessing.Pool(multiprocessing.cpu_count())
p.map(comp_rel(df, permutation, user_dict), user_perm)
p.close()
p.join()

however this breaks because I am using the line :但是这会中断，因为我正在使用该行：

user_dict = comp_rel(df, permutation, user_dict)

in the initial code, and I don't know how these dictionaries should be merged after pool is done.在初始代码中，我不知道在池完成后应该如何合并这些字典。

Answer 1

There are two parts to your comp_rel which need to be separated - first is the calculation itself which is computing some value for some userID.您的comp_rel有两部分需要分开 - 首先是计算本身，它正在为某些用户 ID 计算一些值。 The second is the "accumulation" step which is adding that value to the user_dict result.第二个是“累积”步骤，该步骤将该值添加到user_dict结果中。

You can separate the calculation itself so that it returns a tuple of (id, value) and farm it out with multiprocessing, then accumulate the results afterwards in the main thread:你可以计算本身分离，使得它返回的元组(id, value)和多处理外包出去，然后在主线程之后积累的结果：

from multiprocessing import Pool
from functools import partial
from collections import defaultdict

# We make this a pure function that just returns a result instead of mutating anything
def comp_rel(df, perm):
    ...
    return perm[0], perm[1]

comp_with_df = partial(comp_rel, df) # df is always the same, so factor it out
with Pool(None) as pool: # Pool(None) uses cpu_count automatically
    results = pool.map(comp_with_df, user_perm)

# Now add up the results at the end:
user_dict = defaultdict(int)
for k, v in results:
    user_dict[k] += v

Alternatively you could also pass a Manager().dict() object into the processing function directly, but that's a little more complicated and likely won't get you any additional speed.或者，您也可以直接将Manager().dict()对象传递到处理函数中，但这有点复杂，并且可能不会给您带来任何额外的速度。

Based on @Masklinn's suggestion, here's a slightly better way to do it to avoid memory overhead:根据@Masklinn 的建议，这里有一个更好的方法来避免内存开销：

user_dict = defaultdict(int)
with Pool(None) as pool:
    for k, v in pool.imap_unordered(comp_with_df, user_perm):
        user_dict[k] += v

This way we add up the results as they complete, instead of having to store them all in an intermediate list first.通过这种方式，我们将完成的结果相加，而不必先将它们全部存储在中间列表中。

Answer 2

After short discussion in comments I've decided to post solution using ProcessPoolExecutor :在评论中进行简短讨论后，我决定使用ProcessPoolExecutor发布解决方案：

import concurrent.futures
from collections import defaultdict

def comp_rel(df, perm):
    ...
    return perm[0], perm[1]

user_dict = defaultdict(int)
with concurrent.futures.ProcessPoolExecutor() as executor:
    futures = {executor.submit(comp_rel, df, perm): perm for perm in user_perm}
    for future in concurrent.futures.as_completed(futures):
        try:
            k, v = future.result()
        except Exception as e:
            print(f"{futures[future]} throws {e}")
        else:
            user_dict[k] += v

It works same as @tzaman, but it gives you possibility to handle exceptions.它的工作原理与@tzaman 相同，但它为您提供了处理异常的可能性。 Also there're more interesting features in this module, check docs .此模块中还有更多有趣的功能，请查看docs 。

在将结果附加到字典的 for 循环上使用 python 多处理

问题描述

2 个解决方案

解决方案1
2 2020-02-04 11:04:27

解决方案2
2 已采纳 2020-02-04 13:22:42

在将结果附加到字典的 for 循环上使用 python 多处理

问题描述

2 个解决方案

解决方案1 2 2020-02-04 11:04:27

解决方案2 2 已采纳 2020-02-04 13:22:42

解决方案1
2 2020-02-04 11:04:27

解决方案2
2 已采纳 2020-02-04 13:22:42