簡體   English   中英

Python 多處理跳過子段錯誤

[英]Python Multiprocessing Skip Child Segfault

我正在嘗試對可能返回段錯誤的 function 使用多處理(我無法控制此 ATM)。 在子進程遇到段錯誤的情況下,我只希望那個子進程失敗,但所有其他子任務繼續/返回它們的結果。

我已經從multiprocessing.Pool切換到concurrent.futures.ProcessPoolExecutor以避免子進程永遠掛起(或直到任意超時)的問題,如以下錯誤中所述: https://bugs.python.org/issue22393

然而,我現在面臨的問題是,當第一個子任務遇到段錯誤時,所有運行中的子進程都被標記為已損壞 ( concurrent.futures.process.BrokenProcessPool )。

有沒有辦法只將實際損壞的子進程標記為損壞?

我在Python 3.7.4中運行的代碼:

import concurrent.futures
import ctypes
from time import sleep


def do_something(x):
    print(f"{x}; in do_something")
    sleep(x*3)
    if x == 2:
        # raise a segmentation fault internally
        return x, ctypes.string_at(0)
    return x, x-1


nums = [1, 2, 3, 1.5]
executor = concurrent.futures.ProcessPoolExecutor()
result_futures = []
for num in nums:
    # Using submit with a list instead of map lets you get past the first exception
    # Example: https://stackoverflow.com/a/53346191/7619676
    future = executor.submit(do_something, num)
    result_futures.append(future)

# Wait for all results
concurrent.futures.wait(result_futures)

# After a segfault is hit for any child process (i.e. is "terminated abruptly"), the process pool becomes unusable
# and all running/pending child processes' results are set to broken
for future in result_futures:
    try:
        print(future.result())
    except concurrent.futures.process.BrokenProcessPool:
        print("broken")

結果:

(1, 0)
broken
broken
(1.5, 0.5)

期望的結果:

(1, 0)
broken
(3, 2)
(1.5, 0.5)

根據@Richard Sheridan 的回答,我最終使用了下面的代碼。 這個版本不需要設置超時,這是我無法為我的用例做的事情。

import ctypes
import multiprocessing
from typing import List
from time import sleep


def do_something(x, result):
    print(f"{x} starting")
    sleep(x * 3)
    if x == 2:
        # raise a segmentation fault internally
        y = ctypes.string_at(0)
    y = x
    print(f"{x} done")
    results_queue.put(y)

def wait_for_process_slot(
    processes: List,
    concurrency: int = multiprocessing.cpu_count() - 1,
    wait_sec: int = 1,
) -> int:
    """Blocks main process if `concurrency` processes are already running.

    Alternative to `multiprocessing.Semaphore.acquire`
    useful for when child processes might fail and not be able to signal.
    Relies instead on the main's (parent's) tracking of `multiprocessing.Process`es.

    """
    counter = 0
    while True:
        counter = sum([1 for i, p in processes.items() if p.is_alive()])
        if counter < concurrency:
            return counter
        sleep(wait_sec)


if __name__ == "__main__":
    # "spawn" results in an OSError b/c pickling a segfault fails?
    ctx = multiprocessing.get_context()
    manager = ctx.Manager()
    results_queue = manager.Queue(maxsize=-1)

    concurrency = multiprocessing.cpu_count() - 1  # reserve 1 CPU for waiting
    nums = [3, 1, 2, 1.5]
    all_processes = {}
    for idx, num in enumerate(nums):
        num_running_processes = wait_for_process_slot(all_processes, concurrency)

        p = ctx.Process(target=do_something, args=(num, results_queue), daemon=True)
        all_processes.update({idx: p})
        p.start()

    # Wait for the last batch of processes not blocked by wait_for_process_slot to finish
    for p in all_processes.values():
        p.join()

    # Check last batch of processes for bad processes
    # Relies on all processes having finished (the p.joins above)
    bad_nums = [idx for idx, p in all_processes.items() if p.exitcode != 0]

multiprocessing.Poolconcurrent.futures.ProcessPoolExecutor都假設如果任何一個進程被殺死或段錯誤,如何處理工作者和主進程之間的交互的並發性,所以他們做安全的事情並標記整體游泳池壞了。 為了解決這個問題,您需要直接使用multiprocessing.Process實例建立自己的池,並使用不同的假設。

這可能聽起來很嚇人,但一個list和一個multiprocessing.Manager會讓你走得很遠:

import multiprocessing
import ctypes
import queue
from time import sleep

def do_something(job, result):
    while True:
        x=job.get()
        print(f"{x}; in do_something")
        sleep(x*3)
        if x == 2:
            # raise a segmentation fault internally
            return x, ctypes.string_at(0)
        result.put((x, x-1))

nums = [1, 2, 3, 1.5]

if __name__ == "__main__":
    # you ARE using the spawn context, right?
    ctx = multiprocessing.get_context("spawn")
    manager = ctx.Manager()
    job_queue = manager.Queue(maxsize=-1)
    result_queue = manager.Queue(maxsize=-1)
    pool = [
        ctx.Process(target=do_something, args=(job_queue, result_queue), daemon=True)
        for _ in range(multiprocessing.cpu_count())
    ]
    for proc in pool:
        proc.start()
    for num in nums:
        job_queue.put(num)
    try:
        while True:
            # Timeout is our only signal that no more results coming
            print(result_queue.get(timeout=10))
    except queue.Empty:
        print("Done!")
    print(pool)  # will see one dead Process 

這個“池”有點不靈活,您可能希望根據應用程序的特定需求對其進行自定義。 但是你絕對可以跳過段錯誤的工人。

當我進入這個兔子洞時,我有興趣取消對工作池的特定提交,最終我編寫了一個完整的庫來集成到 Trio 異步應用程序中: trio-parallel 希望您不需要 go 那么遠!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM