繁体   English   中英

如果子进程被杀死,多处理池会挂起

[英]Multiprocessing Pool hangs if child process killed

我启动了一个工作进程池并提交了一堆任务。 系统在 memory 上运行不足,并且 oomkiller 杀死了一个工作进程。 父进程只是挂在那里等待任务完成并且从未返回。

这是一个重现问题的可运行示例。 我没有等待 oomkiller 杀死一个工作进程,而是找到所有工作进程的进程 ID 并告诉第一个任务杀死该进程。 (对ps的调用不适用于所有操作系统。)

import os
import signal
from multiprocessing import Pool
from random import choice
from subprocess import run, PIPE
from time import sleep


def run_task(task):
    target_process_id, n = task
    print(f'Processing item {n} in process {os.getpid()}.')
    delay = n + 1
    sleep(delay)
    if n == 0:
        print(f'Item {n} killing process {target_process_id}.')
        os.kill(target_process_id, signal.SIGKILL)
    else:
        print(f'Item {n} finished.')
    return n, delay


def main():
    print('Starting.')
    pool = Pool()

    ps_output = run(['ps', '-opid', '--no-headers', '--ppid', str(os.getpid())],
                    stdout=PIPE, encoding='utf8')
    child_process_ids = [int(line) for line in ps_output.stdout.splitlines()]
    target_process_id = choice(child_process_ids[1:-1])

    tasks = ((target_process_id, i) for i in range(10))
    for n, delay in pool.imap_unordered(run_task, tasks):
        print(f'Received {delay} from item {n}.')

    print('Closing.')
    pool.close()
    pool.join()
    print('Done.')


if __name__ == '__main__':
    main()

当我在具有八个 CPU 的系统上运行它时,我看到了这个 output:

Starting.
Processing item 0 in process 303.
Processing item 1 in process 304.
Processing item 2 in process 305.
Processing item 3 in process 306.
Processing item 4 in process 307.
Processing item 5 in process 308.
Processing item 6 in process 309.
Processing item 7 in process 310.
Item 0 killing process 308.
Processing item 8 in process 303.
Received 1 from item 0.
Processing item 9 in process 315.
Item 1 finished.
Received 2 from item 1.
Item 2 finished.
Received 3 from item 2.
Item 3 finished.
Received 4 from item 3.
Item 4 finished.
Received 5 from item 4.
Item 6 finished.
Received 7 from item 6.
Item 7 finished.
Received 8 from item 7.
Item 8 finished.
Received 9 from item 8.
Item 9 finished.
Received 10 from item 9.

您可以看到项目 5 永远不会返回,并且池只会永远等待。

当子进程被杀死时,如何让父进程注意到?

这个问题在Python 错误 9205中有描述,但他们决定在concurrent.futures模块中而不是在multiprocessing模块中修复它。 为了利用该修复程序,请切换到较新的进程池。

import os
import signal
from concurrent.futures.process import ProcessPoolExecutor
from random import choice
from subprocess import run, PIPE
from time import sleep


def run_task(task):
    target_process_id, n = task
    print(f'Processing item {n} in process {os.getpid()}.')
    delay = n + 1
    sleep(delay)
    if n == 0:
        print(f'Item {n} killing process {target_process_id}.')
        os.kill(target_process_id, signal.SIGKILL)
    else:
        print(f'Item {n} finished.')
    return n, delay


def main():
    print('Starting.')
    pool = ProcessPoolExecutor()

    pool.submit(lambda: None)  # Force the pool to launch some child processes.
    ps_output = run(['ps', '-opid', '--no-headers', '--ppid', str(os.getpid())],
                    stdout=PIPE, encoding='utf8')
    child_process_ids = [int(line) for line in ps_output.stdout.splitlines()]
    target_process_id = choice(child_process_ids[1:-1])

    tasks = ((target_process_id, i) for i in range(10))
    for n, delay in pool.map(run_task, tasks):
        print(f'Received {delay} from item {n}.')

    print('Closing.')
    pool.shutdown()
    print('Done.')


if __name__ == '__main__':
    main()

现在,当您运行它时,您会收到一条清晰的错误消息。

Starting.
Processing item 0 in process 549.
Processing item 1 in process 550.
Processing item 2 in process 552.
Processing item 3 in process 551.
Processing item 4 in process 553.
Processing item 5 in process 554.
Processing item 6 in process 555.
Processing item 7 in process 556.
Item 0 killing process 556.
Processing item 8 in process 549.
Received 1 from item 0.
Traceback (most recent call last):
  File "/home/don/.config/JetBrains/PyCharm2020.1/scratches/scratch2.py", line 42, in <module>
    main()
  File "/home/don/.config/JetBrains/PyCharm2020.1/scratches/scratch2.py", line 33, in main
    for n, delay in pool.map(run_task, tasks):
  File "/usr/lib/python3.7/concurrent/futures/process.py", line 483, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

我遇到了同样的问题,在处理问题时,concurrent.futures 也好不到哪里去。 我最终使用了Ray模块,这是我的示例代码,它重试了正在减少的工作人员数量的已终止任务。 这样,最饥饿的 memory 有机会在最坏的情况下在一个工人身上完成。 小心运行它,因为 OOM 杀手也可能杀死其他进程:

import ray
import logging
from multiprocessing import cpu_count
import numpy as np
import psutil

# the default max_retries is 3, but in this case there is no point to retry with the same amount of workers
@ray.remote(max_retries=0)
def f(x):
    logging.warning("worker started %s", x)
    allocate = int(psutil.virtual_memory().total / (cpu_count() - 3) / 8)
    logging.warning("worker allocate %s element float array for %s", allocate, x)
    crash = np.ones([allocate])
    # make sure the interpreter won't optimize out the above allocation
    logging.warning("worker print %s for %x", crash[0], x)
    logging.warning("worker finished %s", x)
    return x

def main():
    processes = cpu_count() - 1
    alljobs = range(processes + 1)
    completedjobs = []

    try:
        while alljobs:
            logging.warning("Number of jobs: %s", len(alljobs))
            logging.warning("Number of workers: %s", processes)
            ray.init(num_cpus=processes)
            result_ids = [f.remote(i) for i in alljobs]
            while True:
                try:
                    while len(result_ids):
                        done_id, result_ids = ray.wait(result_ids, num_returns=1)
                        x = ray.get(done_id[0])
                        logging.warning("results from %s", x)
                        completedjobs.append(x)
                except ray.exceptions.WorkerCrashedError:
                    logging.warning("Continue after WorkerCrashedError")
                    continue
                break
            # rerun the killed jobs on fewer workers to relieve memory pressure
            alljobs = list(set(alljobs) - set(completedjobs))
            ray.shutdown()
            if processes > 1:
                processes -= 1
            else:
                break
    except Exception as ex:
        template = "An exception of type {0} occurred. Arguments:\n{1!r}"
        message = template.format(type(ex).__name__, ex.args)
        logging.exception(message)
        raise

if __name__ == "__main__":
    main()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM