简体繁体 English

SystemError：NULL结果多处理Python

[英]SystemError: NULL result Multiprocessing Python

原文 2017-11-27 19:13:10 8 1 python/ multiprocessing/ system-error/ pyobject

I am using a Multiprocessing pool to train machine learners. 我正在使用一个多处理池来训练机器学习者。

Each LearnerRun object gets a learner, a dictionary of hyperparameters, a name, some more options in an other options dictionary, the name of a directory to write results to, a set of IDs of examples to train on (a slice or numpy array), and a set of IDs of examples to test on (also a slice or numpy array). 每个LearnerRun对象都有一个学习器，一个超参数字典，一个名称，另一个选项字典中的更多选项，要向其中写入结果的目录的名称，一组要在其上进行训练的示例ID（切片或numpy数组），以及一组要测试的示例ID（也包括slice或numpy数组）。 Importantly, the training and testing data are not read yet: The sets of IDs are relatively small and direct a later function's database-reading behavior. 重要的是，尚未读取培训和测试数据：ID的集合相对较小，并指示后续功能的数据库读取行为。

I call self.pool.apply_async(learner_run.run) , which formerly worked fine. 我叫self.pool.apply_async(learner_run.run) ，以前很好。 Now the pool seems to be loaded up, but a print statement at the top of the run() function is never printed, so the processes are not actually getting run. 现在，该池似乎已加载完毕，但是从不打印run（）函数顶部的打印语句，因此实际上并没有运行这些进程。

I've tracked down some other threads about this and found that I can see the problem in more detail with handler = self.pool.apply_async(learner_run.run) followed by handler.get() . 我已经找到了一些有关此其他线程，我发现我可以在更多的细节看这个问题handler = self.pool.apply_async(learner_run.run)然后handler.get() This prints "SystemError: NULL result without error in PyObject_Call". 这将打印“ SystemError：PyObject_Call中没有错误的NULL结果”。

Great, something I can Google. 太好了，我可以用Google来做。 But all I can find on this issue with Multiprocessing is that it can be caused when passing arguments that are too big to pickle to the subprocess. 但是我在Multiprocessing上可以找到的所有问题是，将太大而无法腌制的参数传递给子流程时，可能会导致这种情况。 But , I am obviously passing no arguments to my subprocess. 但是，很明显，我没有给子进程传递任何参数。 So what gives? 那有什么呢？

What else, aside from arguments exceeding the allotted memory size--which I am reasonably sure is not the problem here--can cause apply_async to give a null result? 除了超出分配的内存大小的参数（我可以肯定地确定这不是问题）之外，还能导致apply_async给出空结果吗？

Again, this worked before I left for vacation and hasn't been changed. 再说一次，这在我去度假之前一直有效，并且没有改变。 What kinds of changes to other code might cause this to stop working? 对其他代码进行哪些类型的更改可能会导致该代码停止工作？

If I do not try to get() from the handler so execution doesn't stop on errors, the memory usage follows this strange pattern. 如果我不尝试从处理程序中get()以便执行不会因错误而停止，则内存使用情况将遵循这种奇怪的模式。

1 个解决方案

Okay, I found the problem. 好的，我发现了问题。 In fact, my LearnerRun was too large for Multiprocessing to handle. 事实上，我LearnerRun 太大的多重处理。 But the way in which it was is pretty subtle, so I'll describe. 但是它的方式非常微妙，因此我将对其进行描述。

Evidently it is not just the arguments that need to be pickled; 显然，不仅需要腌制论据，还需要腌制这些论据。 the function is pickled too, including the LearnerRun object its execution will rely on (the self ). 该函数也会被腌制，包括其执行将依赖的LearnerRun对象（ self ）。

LearnerRun's constructor takes all the things in the options dictionary passed to it and uses setattr to turn all the keys and values in to member variables with values. LearnerRun的构造函数将传递给它的选项字典中的所有内容用作对象，并使用setattr将所有键和值转换为具有值的成员变量。 This alone is fine, but my coworker realized that this left a couple of strings that will need to be database references and set self.trainDatabase = LarData(self.trainDatabase) and self.coverageDatabase = LarData(self.coverageDatabase) , which ordinarily would be fine. 仅此一个就好，但是我的同事意识到这留下了一些字符串，需要将其self.trainDatabase = LarData(self.trainDatabase)数据库引用，并设置self.trainDatabase = LarData(self.trainDatabase)和self.coverageDatabase = LarData(self.coverageDatabase) ，通常没事的。

Except this means that to pickle the class you have to pickle the entirety of the databases! 除此之外，这意味着要腌制该类，您必须腌制整个数据库！ I discovered this during a sanity check wherein just serialized the LearnerRun itself to see what would happen with pickle.dumps(learner_run) . 我在进行健全性检查时发现了这一点，其中只是序列化了LearnerRun本身，以查看pickle.dumps(learner_run)会发生什么。 My memory was flooded, and the swap began filling up alarmingly quickly until stackoverflow . 我的记忆被淹没了，交换开始迅速惊人地填满，直到stackoverflow为止。

So what about pickling to the disk? 那么如何腌制到磁盘呢？ pickle.dump(learner_run, filename) also blew up. pickle.dump(learner_run, filename)也被炸掉了。 It got to 14.3 GiB before I terminated! 我终止之前达到了14.3 GiB！

What about removing those references and calling the LarData constructor later when needed? 删除那些引用并在以后需要时调用LarData构造函数该怎么办？ Bam. am Fixed. 固定。 Everything works. 一切正常。 Multiprocessing doesn't give a mysterious SystemError anymore. 多重处理不再带来神秘的SystemError了。