使用python在多处理中共享字典

Question

In my program I need to share a dictionary between processes in multiprocessing with Python. 在我的程序中，我需要在使用Python进行多处理的进程之间共享一个字典。 I've simplified the code to put here an example: 我简化了代码，在此举一个例子：

import multiprocessing
def folding (return_dict, seq):
    dis = 1
    d = 0
    ddg = 1 
    '''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
    return_dict [seq] = [dis, d, ddg]

seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', atcggatcg', agctgctagct']
manager = Manager()
return_dict = manager.dict()
n_cores = 3

for i in range (0, len(seqs), n_cores) #n_cores is the number of cores availables in the computer, defined by the user
    subseqs = seqs[i:i + n_cores]
    processes = [Process(target=folding, args =(return_dict, seq)) for seq in subseqs]
    for p in processes:
        p.start()
    for p in processes:
        p.join()

for i in retun_dict:
    print i

I expected to have at the end of the program the return_dict with all of the property values. 我希望程序结尾处具有所有属性值的return_dict。 When I run my program, that has to do this with seqs list of thousand of sequences and repeat it lot of times, sometimes I get the correct result, but sometimes (the most part of the times) the program never ends, but retuns any error, and I have no idea what is going wrong. 当我运行程序时，必须使用数千个序列的seqs列表来执行此操作，并重复很多次，有时我会得到正确的结果，但是有时（大部分时间）程序永远不会结束，但是会重新调整任何内容错误，我不知道出了什么问题。 Futhermore, I think that this is not very efficient in time, I wanted to know if there are other way to do this most efficiently and faster. 此外，我认为这在时间上不是很有效，我想知道是否还有其他方法可以最有效，更快地执行此操作。 Thank you every body! 谢谢大家！

Answer 1

With fixing some minor syntax errors your code seems to work. 通过修复一些小的语法错误，您的代码似乎可以正常工作。

However, I would use a multiprocessing pool instead of your custom solution to run always n_cores processes at a time. 但是，我将使用多处理池而不是自定义解决方案来一次始终运行n_cores进程。 The problem with your approach is that all processes need to be finished before you start the next batch. 您的方法存在的问题是，在开始下一批之前，所有过程都需要完成。 Depending on how variable the time is which you need to compute folding , you can encounter a bottleneck. 根据计算folding所需时间的可变程度，您可能会遇到瓶颈。 In the worst case, this means no speed up whatsoever compared to single core processing. 在最坏的情况下，这意味着与单核处理相比根本无法提高速度。

Moreover, your program will run into serious problems under Windows. 而且，您的程序在Windows下会遇到严重的问题。 You need to make sure that your main module can be safely imported without re-running your multiprocessing code. 您需要确保可以安全地导入主模块，而无需重新运行多处理代码。 That is, you need to protect your main entry point via if __name__ == '__main___' which you might have seen already in other python scripts. 也就是说，您需要通过if __name__ == '__main___'保护您的主入口点 ，您可能已经在其他python脚本中看到了。 This will make sure that your protected code will only be executed if your script is started as the main file from the interpreter, ie only once and not by every new child process you spawn. 这将确保仅在脚本作为解释器的主文件启动时才执行受保护的代码，即仅执行一次，而不是由您产生的每个新子进程启动。

Here is my slightly changed version of your code using a pool: 这是我使用池对代码进行的稍微改动的版本：

import multiprocessing as mp


def folding(return_dict, seq):
    dis = 1
    d = 0
    ddg = 1
    '''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
    return_dict[seq] = [dis, d, ddg]


def main():
    seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', 'atcggatcg', 'agctgctagct']
    manager = mp.Manager()
    return_dict = manager.dict()
    n_cores = 3

    # created pool running maximum 3 cores
    pool = mp.Pool(n_cores)

    # Execute the folding task in parallel
    for seq in seqs:
        pool.apply_async(folding, args=(return_dict, seq))

    # Tell the pool that there are no more tasks to come and join
    pool.close()
    pool.join()

    # Print the results
    for i in return_dict.keys():
        print(i, return_dict[i])


if __name__ == '__main__':
    # Protected main function
    main()

and this will print 这将打印

atcgtg [1, 0, 1]
atcgatcgatc [1, 0, 1]
agcgatcg [1, 0, 1]
atcggatcg [1, 0, 1]
agctgctagct [1, 0, 1]

Example without Shared Data 没有共享数据的示例

EDIT: Also in your case there is actually no need to have a shared data structure. 编辑：同样在您的情况下，实际上不需要共享数据结构。 You could simply rely on the pool's map function. 您可以简单地依靠池的地图功能。 map takes an iterable which is then used to call the function folding with all elements of the iterable once. map接受一个iterable ，然后将其用于调用一次与iterable的所有元素一起folding的函数。 Using map over map_asnyc has the advantage that the results are in order of the inputs. 在map_asnyc上使用map的好处是结果按输入顺序排列。 But you need to wait until all results are gathered until you can process them. 但是您需要等到所有结果都收集完毕后才能处理它们。

Here is an example using map . 这是使用map的示例。 Note that your function folding now returns the result instead of putting it into a shared dictionary: 请注意，函数folding现在返回结果，而不是将其放入共享字典中：

import multiprocessing as mp


def folding(seq):
    dis = 1
    d = 0
    ddg = 1
    '''This is irrelevant, actually my program sends seq parameter to other extern program that returns dis, d and ddg parameters'''
    # Return results instead of using shared data
    return [dis, d, ddg]


def main():
    seqs = ['atcgtg', 'agcgatcg', 'atcgatcgatc', 'atcggatcg', 'agctgctagct']
    n_cores = 3

    pool = mp.Pool(n_cores)

    # Use map which blocks until all results are ready:
    res = pool.map(folding, iterable=seqs)

    pool.close()
    pool.join()

    # Print the results
    # Using list(zip(..)) to print inputs next to outputs
    print(list(zip(seqs, res)))


if __name__ == '__main__':
    main()

and this one prints 这一个打印

[('atcgtg', [1, 0, 1]), ('agcgatcg', [1, 0, 1]), ('atcgatcgatc', [1, 0, 1]), ('atcggatcg', [1, 0, 1]), ('agctgctagct', [1, 0, 1])]

使用python在多处理中共享字典

问题描述

1 个解决方案

解决方案1
5 已采纳 2015-03-02 12:54:00

Example without Shared Data 没有共享数据的示例

使用python在多处理中共享字典

问题描述

1 个解决方案

解决方案1 5 已采纳 2015-03-02 12:54:00

Example without Shared Data 没有共享数据的示例

解决方案1
5 已采纳 2015-03-02 12:54:00