python multiprocessing.pool.map，将参数传递给衍生进程

Question

def content_generator(applications, dict):
    for app in applications:
        yield(app, dict[app])

with open('abc.pickle', 'r') as f:
    very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())

pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
    do some aggregation on result

I have a really large dictionary whose keys are strings (application names), values are information concerning the application. 我有一个非常大的字典，其键是字符串（应用程序名称），值是有关应用程序的信息。 Since applications are independent, I want to use multiprocessing to process them in parallel. 由于应用程序是独立的，我想使用多处理来并行处理它们。 Parallelization works when the dictionary is not that big but all the python processes were killed when the dictionary is too big. 当字典不是很大但字典太大时所有python进程都被杀死时，并行化就可以工作。 I used dmesg to check what went wrong and found they were killed since the machine ran out of memory. 我使用dmesg检查出了什么问题，发现它们因机器内存不足而被杀死。 I did top when the pool processes are running and found that they all occupy the same amount of resident memory(RES), which is all 3.4G. 当池进程正在运行时我做到了top并发现它们都占用了相同数量的驻留内存（RES），这都是3.4G。 This confuses me since it seems to have copied the whole dictionaries into the spawned processes. 这让我感到困惑，因为它似乎已经将整个词典复制到了衍生过程中。 I thought I broke up the dictionary and passing only what is relevant to the spawned process by yielding only dict[app] instead of dict . 我以为我打破了字典，只通过dict[app]而不是dict来传递与生成过程相关的内容。 Any thoughts on what I did wrong? 对我做错了什么的想法？

Answer 1

The comments are becoming impossible to follow, so I'm pasting in my important comment here: 评论变得无法遵循，所以我在这里粘贴我的重要评论：

On a Linux-y system, new processes are created by fork() , so get a copy of the entire parent-process address space at the time they're created. 在Linux-y系统上， fork()创建了新进程，因此在创建它们时获取整个父进程地址空间的副本。 It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your Pool before creating giant data structures. 它是“写入时复制”，因此更像是“虚拟”副本而非“真实”副本，但仍然...... ;-)首先，尝试在创建巨型数据结构之前创建Pool 。 Then the child processes will inherit a much smaller address space. 然后子进程将继承一个小得多的地址空间。

Then some answers to questions: 然后是一些问题的答案：

so in python 2.7, there is no way to spawn a new process? 所以在python 2.7中，没有办法产生一个新进程？

On Linux-y systems, no. 在Linux-y系统上，没有。 The ability to use "spawn" on those was first added in Python 3.4. 在Python 3.4中首次添加了对它们使用“spawn”的能力。 On Windows systems, "spawn" has always been the only choice (no fork() on Windows). 在Windows系统上，“spawn”一直是唯一的选择（Windows上没有fork() ）。

The big dictionary is passed in to a function as an argument and I could only create the pool inside this function. 大字典作为参数传递给函数，我只能在这个函数中创建池。 How would I be able to create the pool before the big dictionary 如何在大词典之前创建池

As simple as this: make these two lines the first two lines in your program: 这很简单：将这两行作为程序中的前两行：

import multiprocessing
pool = multiprocessing.Pool()

You can create the pool any time you like (just so long as it exists sometime before you actually use it), and worker processes will inherit the entire address space at the time the Pool constructor is invoked. 您可以随时创建池（只要它在您实际使用它之前的某个时间存在），并且工作进程将在调用Pool构造函数时继承整个地址空间。

ANOTHER SUGGESTION 另一个建议

If you're not mutating the dict after it's created, try using this instead: 如果您在创建dict之后没有改变它，请尝试使用它：

def content_generator(dict):
    for app in dict:
        yield app, dict[app]

That way you don't have to materialize a giant set of the keys either. 这样你就不必实现一大堆钥匙了。 Or, even better (if possible), skip all that and iterate directly over the items: 或者，甚至更好（如果可能的话），跳过所有这些并直接遍历项目：

for result in pool.imap_unordered(func_process_application, very_large_dict.iteritems()):

python multiprocessing.pool.map，将参数传递给衍生进程

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-07-14 21:24:41

ANOTHER SUGGESTION 另一个建议

python multiprocessing.pool.map，将参数传递给衍生进程

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-07-14 21:24:41

ANOTHER SUGGESTION 另一个建议

解决方案1
1 已采纳 2016-07-14 21:24:41