[英]Python Multiprocessing Skip Child Segfault
我正在嘗試對可能返回段錯誤的 function 使用多處理(我無法控制此 ATM)。 在子進程遇到段錯誤的情況下,我只希望那個子進程失敗,但所有其他子任務繼續/返回它們的結果。
我已經從multiprocessing.Pool
切換到concurrent.futures.ProcessPoolExecutor
以避免子進程永遠掛起(或直到任意超時)的問題,如以下錯誤中所述: https://bugs.python.org/issue22393 。
然而,我現在面臨的問題是,當第一個子任務遇到段錯誤時,所有運行中的子進程都被標記為已損壞 ( concurrent.futures.process.BrokenProcessPool
)。
有沒有辦法只將實際損壞的子進程標記為損壞?
我在Python 3.7.4
中運行的代碼:
import concurrent.futures
import ctypes
from time import sleep
def do_something(x):
print(f"{x}; in do_something")
sleep(x*3)
if x == 2:
# raise a segmentation fault internally
return x, ctypes.string_at(0)
return x, x-1
nums = [1, 2, 3, 1.5]
executor = concurrent.futures.ProcessPoolExecutor()
result_futures = []
for num in nums:
# Using submit with a list instead of map lets you get past the first exception
# Example: https://stackoverflow.com/a/53346191/7619676
future = executor.submit(do_something, num)
result_futures.append(future)
# Wait for all results
concurrent.futures.wait(result_futures)
# After a segfault is hit for any child process (i.e. is "terminated abruptly"), the process pool becomes unusable
# and all running/pending child processes' results are set to broken
for future in result_futures:
try:
print(future.result())
except concurrent.futures.process.BrokenProcessPool:
print("broken")
結果:
(1, 0)
broken
broken
(1.5, 0.5)
期望的結果:
(1, 0)
broken
(3, 2)
(1.5, 0.5)
根據@Richard Sheridan 的回答,我最終使用了下面的代碼。 這個版本不需要設置超時,這是我無法為我的用例做的事情。
import ctypes
import multiprocessing
from typing import List
from time import sleep
def do_something(x, result):
print(f"{x} starting")
sleep(x * 3)
if x == 2:
# raise a segmentation fault internally
y = ctypes.string_at(0)
y = x
print(f"{x} done")
results_queue.put(y)
def wait_for_process_slot(
processes: List,
concurrency: int = multiprocessing.cpu_count() - 1,
wait_sec: int = 1,
) -> int:
"""Blocks main process if `concurrency` processes are already running.
Alternative to `multiprocessing.Semaphore.acquire`
useful for when child processes might fail and not be able to signal.
Relies instead on the main's (parent's) tracking of `multiprocessing.Process`es.
"""
counter = 0
while True:
counter = sum([1 for i, p in processes.items() if p.is_alive()])
if counter < concurrency:
return counter
sleep(wait_sec)
if __name__ == "__main__":
# "spawn" results in an OSError b/c pickling a segfault fails?
ctx = multiprocessing.get_context()
manager = ctx.Manager()
results_queue = manager.Queue(maxsize=-1)
concurrency = multiprocessing.cpu_count() - 1 # reserve 1 CPU for waiting
nums = [3, 1, 2, 1.5]
all_processes = {}
for idx, num in enumerate(nums):
num_running_processes = wait_for_process_slot(all_processes, concurrency)
p = ctx.Process(target=do_something, args=(num, results_queue), daemon=True)
all_processes.update({idx: p})
p.start()
# Wait for the last batch of processes not blocked by wait_for_process_slot to finish
for p in all_processes.values():
p.join()
# Check last batch of processes for bad processes
# Relies on all processes having finished (the p.joins above)
bad_nums = [idx for idx, p in all_processes.items() if p.exitcode != 0]
multiprocessing.Pool
和concurrent.futures.ProcessPoolExecutor
都假設如果任何一個進程被殺死或段錯誤,如何處理工作者和主進程之間的交互的並發性,所以他們做安全的事情並標記整體游泳池壞了。 為了解決這個問題,您需要直接使用multiprocessing.Process
實例建立自己的池,並使用不同的假設。
這可能聽起來很嚇人,但一個list
和一個multiprocessing.Manager
會讓你走得很遠:
import multiprocessing
import ctypes
import queue
from time import sleep
def do_something(job, result):
while True:
x=job.get()
print(f"{x}; in do_something")
sleep(x*3)
if x == 2:
# raise a segmentation fault internally
return x, ctypes.string_at(0)
result.put((x, x-1))
nums = [1, 2, 3, 1.5]
if __name__ == "__main__":
# you ARE using the spawn context, right?
ctx = multiprocessing.get_context("spawn")
manager = ctx.Manager()
job_queue = manager.Queue(maxsize=-1)
result_queue = manager.Queue(maxsize=-1)
pool = [
ctx.Process(target=do_something, args=(job_queue, result_queue), daemon=True)
for _ in range(multiprocessing.cpu_count())
]
for proc in pool:
proc.start()
for num in nums:
job_queue.put(num)
try:
while True:
# Timeout is our only signal that no more results coming
print(result_queue.get(timeout=10))
except queue.Empty:
print("Done!")
print(pool) # will see one dead Process
這個“池”有點不靈活,您可能希望根據應用程序的特定需求對其進行自定義。 但是你絕對可以跳過段錯誤的工人。
當我進入這個兔子洞時,我有興趣取消對工作池的特定提交,最終我編寫了一個完整的庫來集成到 Trio 異步應用程序中: trio-parallel 。 希望您不需要 go 那么遠!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.