繁体   English   中英

如何在 python 中遇到异常时重新启动池中的进程

[英]how to restart process in pool when hit exception in python

import signal
import asyncio
import os
import random
import time
import multiprocessing

my_list = []
for i in range(0,10):
    n = random.randint(1,100)
    my_list.append(n)


async def loop_item(my_item):
    while True:
        a = random.randint(1, 2)
        if a == 2:
            print(f"process id: {os.getpid()}")
            raise Exception('Error')
        print(f"process id: {os.getpid()} - {my_item}")
        time.sleep(0.5)


def run_loop(my_item):
    asyncio.run(loop_item(my_item))


def throw_error(e):
    os.system('bash /root/my-script.sh')  #that launchs "python my-script.py"
    os.killpg(os.getpgid(os.getpid()), signal.SIGKILL)


if __name__ == '__main__':
    pool = multiprocessing.Pool(processes=10)
    for my_item in my_list:
        pool.apply_async(run_loop, (my_item,), error_callback=throw_error)
    pool.close()
    pool.join()

这是我的演示代码,它将创建一个my_list ,其中包含 10 个项目作为随机数

然后启动 10 个进程以单独使用 pid 将其打印出来

然后我添加一个raise Exception来模拟它可能发生的任何类型的异常,如果发生异常,我想在新进程中重新启动这个loop_item(my_item) function

这有两个障碍,一个是传递变量my_item但我认为我应该能够使其与诸如 Redis 之类的放置/获取变量的外部工具一起使用,但任何更好的想法都值得赞赏。

真正阻止我的是如何在遇到异常并退出后再次有效地启动该进程

so far I was able to use throw_error function to kill the python script itself or launch another shell script to kill and launch python script again, but this approach seems to be less efficient

所以我想知道是否有更好的方法来重新启动一个除外进程而不是重新启动整个脚本?

我试过的一种方法是在throw_error function 中创建一个新的进程池,比如

def throw_error(e):
    pool2 = multiprocessing.Pool(processes=1)
    pool2.apply_async(run_loop, (my_item,), error_callback=throw_error)
    pool2.close()
    pool2.join()

但这似乎是一个坏主意,因为在多次异常之后,进程池正在失去控制,并且积累了数百甚至数千个“僵尸”进程

我假设这是XY Problem之一。

这个答案是建议一种可以解决问题 X 的替代设计,而不是解决问题 Y - 也就是在池中重新启动进程。


据我所知,Python 无法很好地控制已生成的子进程或线程的正常终止。

因此,最好的方法是 - 只是不让每个进程/线程完全失败并首先传播错误。

这可以通过编写一个小包装器来实现,在其中将 function 包装在带有Exceptiontry-except块中 - 然后它将捕获它遇到的任何异常。 然后我们可以使用一个while循环重试。

def wrapper(func, max_retries, data):
    """
    Wrapper adding retry capabilities to workload.

    Args:
        func: function to execute.
        max_retries: Maximum retries until moving on.
        data: data to process.

    Returns:
        (Input data, result) Tuple.
    """

    # wrap in while in case for retry.
    retry_count = 0
    while retry_count <= max_retries:  # while true: if you want infinite loop
        retry_count += 1

        # try to process data
        try:
            result = func(data)
        except Exception as err:
            # on exception save result as error and retry
            result = err
        else:
            # otherwise it was successful, break out of retry loop
            break

    # return result
    return data, result

这是一些测试这个想法的愚蠢的演示代码,失败的可能性只有一半。

import logging
import functools
import random
from os import getpid
from multiprocessing import Pool


logging.basicConfig(format="%(levelname)-8s %(message)s", level=logging.DEBUG)
logger = logging.getLogger()


def wrapper(func, max_retries, data):
    """
    Wrapper adding retry capabilities to workload with some fancy output.

    Args:
        func: function to execute.
        max_retries: Maximum retries until moving on.
        data: data to process.

    Returns:
        (Input data, result) Tuple.
    """
    pid = f"{getpid():<6}"
    logger.info(f"[{pid}]  Processing {data}")

    # just a line to satisfy pylint
    result = None

    # wrap in while in case for retry.
    retry_count = 0

    while retry_count <= max_retries:
        # try to process data
        try:
            result = func(data)
        except Exception as err:
            # on exception print out error, set result as err, then retry
            logger.error(
                f"[{pid}]  {err.__class__.__name__} while processing {data}, "
                f"{max_retries - retry_count} retries left. "
            )
            result = err
        else:
            break

        retry_count += 1

    # print and return result
    logger.info(f"[{pid}]  Processing {data} done")
    return data, result


class RogueAIException(Exception):
    pass


def workload(n):
    """
    Quite rebellious Fibonacci function
    """

    if random.randint(0, 1):
        raise RogueAIException("I'm sorry Dave, I'm Afraid I can't do that.")

    a, b = 0, 1
    for _ in range(n - 1):
        a, b = b, a + b

    return b


def main():
    data = [random.randint(0, 100) for _ in range(20)]

    # fix parameters. Decorator can't be pickled, we'll have to live with this.
    wrapped_workload = functools.partial(wrapper, workload, 3)

    with Pool(processes=3) as pool:
        # apply function for each data
        results = pool.map(wrapped_workload, data)

        print("\nInput Output")
        for fed_data, result in results:
            print(f"{fed_data:<6}{result}")


if __name__ == '__main__':
    main()

Output:

INFO     [13904 ]  Processing 40
ERROR    [13904 ]  RogueAIException while processing 40, 3 retries left. 
INFO     [13904 ]  Processing 40 done
INFO     [13904 ]  Processing 93
ERROR    [13904 ]  RogueAIException while processing 93, 3 retries left. 
INFO     [13904 ]  Processing 93 done
INFO     [13904 ]  Processing 96
INFO     [13904 ]  Processing 96 done
INFO     [13904 ]  Processing 48
INFO     [13904 ]  Processing 48 done
INFO     [13904 ]  Processing 17
INFO     [13904 ]  Processing 17 done
INFO     [13904 ]  Processing 52
ERROR    [13904 ]  RogueAIException while processing 52, 3 retries left. 
INFO     [13904 ]  Processing 52 done
INFO     [13904 ]  Processing 96
ERROR    [13904 ]  RogueAIException while processing 96, 3 retries left. 
INFO     [13904 ]  Processing 96 done
INFO     [13904 ]  Processing 23
ERROR    [13904 ]  RogueAIException while processing 23, 3 retries left. 
INFO     [13904 ]  Processing 23 done
INFO     [13904 ]  Processing 99
ERROR    [13904 ]  RogueAIException while processing 99, 3 retries left. 
ERROR    [13904 ]  RogueAIException while processing 99, 2 retries left.
INFO     [13904 ]  Processing 99 done
INFO     [13904 ]  Processing 55
ERROR    [13904 ]  RogueAIException while processing 55, 3 retries left.
ERROR    [13904 ]  RogueAIException while processing 55, 2 retries left.
INFO     [13904 ]  Processing 55 done
INFO     [13904 ]  Processing 63
ERROR    [13904 ]  RogueAIException while processing 63, 3 retries left.
INFO     [13904 ]  Processing 63 done
INFO     [13904 ]  Processing 61
INFO     [25180 ]  Processing 3
ERROR    [13904 ]  RogueAIException while processing 61, 3 retries left.
INFO     [25180 ]  Processing 3 done
INFO     [13904 ]  Processing 61 done
INFO     [25180 ]  Processing 42
INFO     [13904 ]  Processing 33
ERROR    [25180 ]  RogueAIException while processing 42, 3 retries left.
ERROR    [13904 ]  RogueAIException while processing 33, 3 retries left.
ERROR    [25180 ]  RogueAIException while processing 42, 2 retries left.
INFO     [13904 ]  Processing 33 done
ERROR    [25180 ]  RogueAIException while processing 42, 1 retries left.
INFO     [13904 ]  Processing 2
INFO     [25180 ]  Processing 42 done
INFO     [13904 ]  Processing 2 done
INFO     [25180 ]  Processing 35
INFO     [13904 ]  Processing 45
INFO     [25180 ]  Processing 35 done
INFO     [13904 ]  Processing 45 done
INFO     [25180 ]  Processing 2
INFO     [13904 ]  Processing 11
ERROR    [25180 ]  RogueAIException while processing 2, 3 retries left.
ERROR    [13904 ]  RogueAIException while processing 11, 3 retries left.
INFO     [25180 ]  Processing 2 done
ERROR    [13904 ]  RogueAIException while processing 11, 2 retries left.
ERROR    [13904 ]  RogueAIException while processing 11, 1 retries left.
ERROR    [13904 ]  RogueAIException while processing 11, 0 retries left.
INFO     [13904 ]  Processing 11 done

Input Output
40    102334155
93    12200160415121876738
96    51680708854858323072
48    4807526976
17    1597
52    32951280099
96    51680708854858323072
23    28657
99    218922995834555169026
55    139583862445
63    6557470319842
61    2504730781961
3     2
42    267914296
33    3524578
2     1
35    9227465
2     1
45    1134903170
11    I'm sorry Dave, I'm Afraid I can't do that.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM