Python多处理-池中的进程数是否会因错误而减少？

Question

The code: 编码：

import multiprocessing
print(f'num cpus {multiprocessing.cpu_count():d}')
import sys; print(f'Python {sys.version} on {sys.platform}')

def _process(m):
    print(m) #; return m
    raise ValueError(m)

args_list = [[i] for i in range(1, 20)]

if __name__ == '__main__':
    with multiprocessing.Pool(2) as p:
        print([r for r in p.starmap(_process, args_list)])

prints: 打印：

num cpus 8
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28) 
[Clang 6.0 (clang-600.0.57)] on darwin
1
7
4
10
13
16
19
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/Users/ubik-mac13/Library/Preferences/PyCharm2018.3/scratches/multiprocess_error.py", line 8, in _process
    raise ValueError(m)
ValueError: 1
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ubik-mac13/Library/Preferences/PyCharm2018.3/scratches/multiprocess_error.py", line 18, in <module>
    print([r for r in p.starmap(_process, args_list)])
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 298, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 683, in get
    raise self._value
ValueError: 1

Process finished with exit code 1

Increasing the number of processes in the pool to 3 or 4 prints all the odd numbers (possibly out of order): 将池中的进程数增加到3或4将打印所有奇数（可能是乱序）：

while from 5 and above it prints all the range 1-19. 从5开始，则将打印1-19的所有范围。 So what happens here? 那么这里发生了什么？ Do the processes crash after a number of failures? 多次失败后进程是否崩溃？

This is a toy example of course but it comes from a real code issue I had - having left a multiprocessing pool run for some days steadily the cpu use went down as if some processes were killed (note the cpu utilization going downhill on 03/04 and 03/06 while there was still lots of tasks to be run): 当然，这是一个玩具示例，但是它来自我遇到的一个实际代码问题-使多处理池运行了几天，使cpu的使用率稳定下降，好像某些进程被杀死了（请注意，cpu的利用率在03/04下降了和03/06，尽管仍有许多任务要运行）：

When the code terminated it presented me with one (and one only as here, while the processes were many more) multiprocessing.pool.RemoteTraceback - bonus question is this traceback random? 当代码终止时，它向我展示了一个（只有一个，这里的进程更多）， multiprocessing.pool.RemoteTraceback奖励问题是此回溯是随机的吗？ In this toy example, it is usually ValueError: 1 but sometimes also other numbers appear. 在此玩具示例中，通常为ValueError: 1但有时还会出现其他数字。 Does multiprocessing keep the first traceback from the first process that crashes? 多重处理是否保留了崩溃的第一个进程的第一个回溯？

Answer 1

A quick experiment with watch ps aux in one window and your code in the other seems to say that no, exceptions don't crash the child processes. 在一个窗口中对watch ps aux进行快速实验，在另一个窗口中对您的代码进行了实验，这似乎表明，不，异常不会使子进程崩溃。

The MapResult object that underlies map/starmap operations only collects the first exception, and considers the entire map job a failure if any job fails with an exception. 作为地图/星图操作基础的MapResult对象仅收集第一个异常，并且如果任何作业因异常而失败，则将整个地图作业视为失败。

(How many jobs are sent to each worker to work on is governed by the chunksize parameter to .map() and friends.) （发送给每个工作人员多少工作要由.map()和友人的chunksize参数决定。）

If you want something that's more resilient to exceptions, you could just use .apply_async() : 如果您想要对异常更具弹性的东西，可以只使用.apply_async() ：

import multiprocessing
import os

def _process(m):
    if m % 2 == 0:
        raise ValueError('I only work on odd numbers')
    return m * 8


if __name__ == '__main__':
    args_list = list(range(1, 20))
    with multiprocessing.Pool(2) as p:
        params_and_jobs = [((arg,), p.apply_async(_process, (arg,))) for arg in args_list]
        for params, job in params_and_jobs:
            job.wait()
            # regularly you'd use `job.get()`, but it would `raise` the exception,
            # which is not suitable for this example, so we dig in deeper and just use
            # the `._value` it'd return or raise:
            print(params, type(job._value), job._value)

outputs 输出

(1,) <class 'int'> 8
(2,) <class 'ValueError'> I only work on odd numbers
(3,) <class 'int'> 24
(4,) <class 'ValueError'> I only work on odd numbers
(5,) <class 'int'> 40
(6,) <class 'ValueError'> I only work on odd numbers
(7,) <class 'int'> 56
(8,) <class 'ValueError'> I only work on odd numbers
(9,) <class 'int'> 72
(10,) <class 'ValueError'> I only work on odd numbers
(11,) <class 'int'> 88
(12,) <class 'ValueError'> I only work on odd numbers
(13,) <class 'int'> 104
(14,) <class 'ValueError'> I only work on odd numbers
(15,) <class 'int'> 120
(16,) <class 'ValueError'> I only work on odd numbers
(17,) <class 'int'> 136
(18,) <class 'ValueError'> I only work on odd numbers
(19,) <class 'int'> 152

Answer 2

No, just a whole task blows up, not the process itself. 不，只会炸毁整个任务，而不是过程本身。 Your observed behavior in your toy-example is explainable with the resulting chunksizes for the combination of the number of workers and the length of the iterable. 在您的玩具示例中观察到的行为可以用工人数量和可迭代长度的组合得到的块大小来解释。 When you grab the function calc_chunksize_info from here you can see the difference in the resulting chunksizes: 从此处获取函数calc_chunksize_info ，您可以看到生成的块大小的差异：

calc_chunksize_info(n_workers=2, len_iterable=20)
# Chunkinfo(n_workers=2, len_iterable=20, n_chunks=7, chunksize=3, last_chunk=2)

calc_chunksize_info(n_workers=5, len_iterable=20)
# Chunkinfo(n_workers=5, len_iterable=20, n_chunks=20, chunksize=1, last_chunk=1)

In case the chunksize will be > 1, all untouched "taskels" (1. Definitions: Taskel) within a task are also lost, as soon the first taskel raises an exception. 如果块大小大于1，则一旦第一个taskel引发异常，任务中所有未触及的“ taskel”（1。定义：Taskel）也会丢失。 Handle expectable exceptions directly within your target-function or write an additional wrapper for error-handling to prevent that. 直接在目标函数中处理可预期的异常，或编写其他包装以进行错误处理以防止这种情况。

When the code terminated it presented me with one (and one only as here, while the processes were many more) multiprocessing.pool.RemoteTraceback - bonus question is this traceback random? 当代码终止时，它向我展示了一个（只有一个，这里的进程更多），multiprocessing.pool.RemoteTraceback-奖励问题是此回溯是随机的吗？ In this toy example, it is usually ValueError: 1 but sometimes also other numbers appear. 在此玩具示例中，通常为ValueError：1，但有时还会出现其他数字。 Does multiprocessing keep the first traceback from the first process that crashes? 多重处理是否保留了崩溃的第一个进程的第一个回溯？

The worker processes get tasks from a shared queue. 工作进程从共享队列中获取任务。 Reading from the queue is sequential, so task 1 will always be read before task 2. It's not predictable in which order the results will be ready in the workers, though. 从队列中读取是顺序的，因此任务1将始终在任务2之前读取。尽管如此，在工作人员中准备好结果的顺序是不可预测的。 There are a lot of hardware and OS-dependent factors into play, so yes, the traceback is random as the order of results is random, since the (stringified) traceback is part of the result being send back to the parent. 有很多与硬件和操作系统相关的因素在起作用，所以是的，回溯是随机的，因为结果的顺序是随机的，因为（字符串化的）回溯是将结果发送回父级的一部分。 The results are also send back over a shared queue and Pool internally handles returning tasks JIT. 结果也通过共享队列发送回去，并且Pool内部处理返回的任务JIT。 In case a task returns unsuccessfully, the whole job is marked as not successful and further arriving tasks are discarded. 万一任务返回失败，则将整个作业标记为不成功，并丢弃进一步到达的任务。 Only the first retrieved exception gets reraised in the parent as soon all tasks within the job have returned. 一旦作业中的所有任务都返回，只有第一个检索到的异常在父级中重新引发。

Python多处理-池中的进程数是否会因错误而减少？

问题描述

2 个解决方案

解决方案1
2 2019-03-06 14:35:29

解决方案2
2 已采纳 2019-03-06 22:17:46

Python多处理-池中的进程数是否会因错误而减少？

问题描述

2 个解决方案

解决方案1 2 2019-03-06 14:35:29

解决方案2 2 已采纳 2019-03-06 22:17:46

解决方案1
2 2019-03-06 14:35:29

解决方案2
2 已采纳 2019-03-06 22:17:46