Python 进程未清理以供重用

Question

未清洗以供重复使用的流程

你好呀，

我偶然发现了ProcessPoolExecutor一个问题，其中进程访问数据，他们不应该能够。 让我解释：

我有一个类似于以下示例的情况：我有几次运行，每次运行都以不同的参数开始。 他们并行计算他们的东西，没有理由相互交互。 现在，据我所知，当一个进程分叉时，它会自我复制。 子进程具有与其父进程相同的（内存）数据，但是如果它更改任何内容，它会在自己的副本上进行。 如果我希望更改在子进程的生命周期内继续存在，我会调用队列、管道和其他 IPC 内容。

但其实我没有！ 每个进程都为自己处理数据，这些数据不应延续到任何其他运行中。 但是，下面的示例显示了其他情况。 下一次运行（不是并行运行的）可以访问它们上一次运行的数据，这意味着数据尚未从进程中清除。

代码/示例

from concurrent.futures import ProcessPoolExecutor
from multiprocessing import current_process, set_start_method

class Static:
    integer: int = 0

def inprocess(run: int) -> None:
    cp = current_process()
    # Print current state
    print(f"[{run:2d} {cp.pid} {cp.name}] int: {Static.integer}", flush=True)

    # Check value
    if Static.integer != 0:
        raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")

    # Update value
    Static.integer = run + 1

def pooling():
    cp = current_process()
    # Get master's pid
    print(f"[{cp.pid} {cp.name}] Start")
    with ProcessPoolExecutor(max_workers=2) as executor:
        for i, _ in enumerate(executor.map(inprocess, range(4))):
            print(f"run #{i} finished", flush=True)

if __name__ == '__main__':
    set_start_method("fork")    # enforce fork
    pooling()

输出

[1998 MainProcess] Start
[ 0 2020 Process-1] int: 0
[ 2 2020 Process-1] int: 1
[ 1 2021 Process-2] int: 0
[ 3 2021 Process-2] int: 2
run #0 finished
run #1 finished
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/lib/python3.6/concurrent/futures/process.py", line 153, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/usr/lib/python3.6/concurrent/futures/process.py", line 153, in <listcomp>
    return [fn(*args) for args in chunk]
  File "<stdin>", line 14, in inprocess
Exception: [ 2 2020 Process-1] Variable already set!
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 29, in <module>
  File "<stdin>", line 24, in pooling
  File "/usr/lib/python3.6/concurrent/futures/process.py", line 366, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
Exception: [ 2 2020 Process-1] Variable already set!

此行为也可以通过max_workers=1重现，因为该过程被重用。 start-method 对错误没有影响（尽管只有"fork"似乎使用了多个进程）。

所以总结一下：我希望每个新运行的进程都包含所有以前的数据，但没有来自任何其他运行的新数据。 那可能吗？ 我将如何实现它？ 为什么上面没有做到这一点？

我很感激任何帮助。

我发现multiprocessing.pool.Pool可以设置maxtasksperchild=1 ，因此工作进程在其任务完成时被销毁。 但我不喜欢multiprocessing接口； ProcessPoolExecutor使用起来更舒服。 此外，池的整个想法是节省进程设置时间，在每次运行后终止托管进程时，这些时间会被忽略。

Answer 1

python 中的全新进程不共享内存状态。 然而ProcessPoolExecutor重用流程实例。 毕竟它是一个活动进程池。 我认为这样做是为了防止一直弯腰和启动进程的操作系统开销。

您会在其他分发技术（如 celery）中看到相同的行为，如果您不小心，您可能会在执行之间泄漏全局状态。

我建议您更好地管理您的命名空间以封装您的数据。 使用您的示例，例如，您可以将代码和数据封装在您在inprocess()实例化的父类中，而不是将其存储在共享命名空间中，如类中的静态字段或直接存储在模块中。 这样对象最终会被垃圾收集器清理：

class State:
    def __init__(self):
        self.integer: int = 0

    def do_stuff():
        self.integer += 42

def use_global_function(state):
    state.integer -= 1664
    state.do_stuff()

def inprocess(run: int) -> None:
    cp = current_process()
    state = State()
    print(f"[{run:2d} {cp.pid} {cp.name}] int: {state.integer}", flush=True)
    if state.integer != 0:
        raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")
    state.integer = run + 1
    state.do_stuff()
    use_global_function(state)

Answer 2

我一直面临一些潜在的类似问题，并在这个使用 Python Multiprocessing 的 High Memory Usage 中看到了一些有趣的帖子，这指向使用 gc.collector()，但是在你的情况下它没有工作。 于是想到了Static类是如何初始化的，有几点：

不幸的是，我无法重现您的最小示例值错误提示：ValueError: cannot find context for 'fork'
考虑到 1，我使用 set_start_method("spawn") 一个快速修复然后可以是每次静态类初始化如下：

{
    class Static:
        integer: int = 0
        def __init__(self):
            pass
    
    def inprocess(run: int) -> None:
        cp = current_process()
        # Print current state
        print(f"[{run:2d} {cp.pid} {cp.name}] int: {Static().integer}", flush=True)
    
        # Check value
        if Static().integer != 0:
            raise Exception(f"[{run:2d} {cp.pid} {cp.name}] Variable already set!")
    
        # Update value
        Static().integer = run + 1
    
    
    def pooling():
        cp = current_process()
        # Get master's pid
        print(f"[{cp.pid} {cp.name}] Start")
        with ProcessPoolExecutor(max_workers=2) as executor:
            for i, _ in enumerate(executor.map(inprocess, range(4))):
                print(f"run #{i} finished", flush=True)
    
    
    if __name__ == "__main__":
        print("start")
        # set_start_method("fork")  # enforce fork , ValueError: cannot find context for 'fork'
        set_start_method("spawn")    # Alternative
        pooling()
}

这将返回：

[ 0 1424 SpawnProcess-2] int: 0
[ 1 1424 SpawnProcess-2] int: 0
run #0 finished
[ 2 17956 SpawnProcess-1] int: 0
[ 3 1424 SpawnProcess-2] int: 0
run #1 finished
run #2 finished
run #3 finished

Python 进程未清理以供重用

问题描述

未清洗以供重复使用的流程

代码/示例

输出

2 个解决方案

解决方案1
3 2018-06-21 14:10:27

解决方案2
0 2021-02-21 17:38:27

Python 进程未清理以供重用

问题描述

未清洗以供重复使用的流程

代码/示例

输出

2 个解决方案

解决方案1 3 2018-06-21 14:10:27

解决方案2 0 2021-02-21 17:38:27

解决方案1
3 2018-06-21 14:10:27

解决方案2
0 2021-02-21 17:38:27