[英]how to to terminate process using python's multiprocessing
我有一些代码需要针对其他几个可能挂起或出现不受我控制的问题的系统运行。 我想使用 python 的多处理来生成子进程以独立于主程序运行,然后当它们挂起或出现问题时终止它们,但我不确定最好的方法来解决这个问题。
当调用 terminate 时,它确实会杀死子进程,但随后它变成了一个不复存在的僵尸,直到进程对象消失后才会被释放。 下面的示例代码在循环永不结束的情况下可以杀死它并在再次调用时允许重新生成,但似乎不是解决此问题的好方法(即 multiprocessing.Process() 在 __init__() 中会更好)。
有人有建议吗?
class Process(object):
def __init__(self):
self.thing = Thing()
self.running_flag = multiprocessing.Value("i", 1)
def run(self):
self.process = multiprocessing.Process(target=self.thing.worker, args=(self.running_flag,))
self.process.start()
print self.process.pid
def pause_resume(self):
self.running_flag.value = not self.running_flag.value
def terminate(self):
self.process.terminate()
class Thing(object):
def __init__(self):
self.count = 1
def worker(self,running_flag):
while True:
if running_flag.value:
self.do_work()
def do_work(self):
print "working {0} ...".format(self.count)
self.count += 1
time.sleep(1)
您可以在后台将子进程作为守护进程运行。
process.daemon = True
守护进程中的任何错误和挂起(或无限循环)都不会影响主进程,并且只有在主进程退出后才会终止。
这将适用于简单的问题,直到您遇到许多子守护进程,这些子守护进程将在没有任何明确控制的情况下不断从父进程中获取内存。
最好的方法是设置一个Queue
让所有子进程与父进程通信,以便我们可以join
它们并很好地清理。 下面是一些简单的代码,用于检查子进程是否挂起(又名time.sleep(1000)
),并将消息发送到队列以便主进程对其采取行动:
import multiprocessing as mp
import time
import queue
running_flag = mp.Value("i", 1)
def worker(running_flag, q):
count = 1
while True:
if running_flag.value:
print "working {0} ...".format(count)
count += 1
q.put(count)
time.sleep(1)
if count > 3:
# Simulate hanging with sleep
print "hanging..."
time.sleep(1000)
def watchdog(q):
"""
This check the queue for updates and send a signal to it
when the child process isn't sending anything for too long
"""
while True:
try:
msg = q.get(timeout=10.0)
except queue.Empty as e:
print "[WATCHDOG]: Maybe WORKER is slacking"
q.put("KILL WORKER")
def main():
"""The main process"""
q = mp.Queue()
workr = mp.Process(target=worker, args=(running_flag, q))
wdog = mp.Process(target=watchdog, args=(q,))
# run the watchdog as daemon so it terminates with the main process
wdog.daemon = True
workr.start()
print "[MAIN]: starting process P1"
wdog.start()
# Poll the queue
while True:
msg = q.get()
if msg == "KILL WATCHDOG":
print "[MAIN]: Terminating slacking WORKER"
workr.terminate()
time.sleep(0.1)
if not workr.is_alive():
print "[MAIN]: WORKER is a goner"
workr.join(timeout=1.0)
print "[MAIN]: Joined WORKER successfully!"
q.close()
break # watchdog process daemon gets terminated
if __name__ == '__main__':
main()
如果不终止worker
,尝试将其join()
到主进程将永远阻塞,因为worker
从未完成。
Python 多处理处理进程的方式有点令人困惑。
从多处理指南:
加入僵尸进程
在 Unix 上,当一个进程完成但尚未加入时,它会变成僵尸。 永远不应该有很多,因为每次启动一个新进程(或调用 active_children() )时,所有尚未加入的已完成进程都将被加入。 同时调用已完成进程的 Process.is_alive 将加入该进程。 即便如此,明确加入您启动的所有流程可能是一种很好的做法。
为了避免进程变成僵尸进程,您需要在杀死它后调用它的join()
方法。
如果您想要一种更简单的方法来处理系统中的挂起呼叫,您可以查看pebble 。
(没有足够的声望点来评论,特此完整回答)
@PieOhPah:谢谢你的这个很好的例子。
不幸的是,只有一个小缺陷不会让看门狗杀死工人:
if msg == "KILL WATCHDOG":
它应该是:
if msg == "KILL WORKER":
所以代码变成了(为python3更新了打印):
import multiprocessing as mp
import time
import queue
running_flag = mp.Value("i", 1)
def worker(running_flag, q):
count = 1
while True:
if running_flag.value:
print ("working {0} ...".format(count))
count += 1
q.put(count)
time.sleep(1)
if count > 3:
# Simulate hanging with sleep
print ("hanging...")
time.sleep(1000)
def watchdog(q):
"""
This check the queue for updates and send a signal to it
when the child process isn't sending anything for too long
"""
while True:
try:
msg = q.get(timeout=10.0)
except queue.Empty as e:
print ("[WATCHDOG]: Maybe WORKER is slacking")
q.put("KILL WORKER")
def main():
"""The main process"""
q = mp.Queue()
workr = mp.Process(target=worker, args=(running_flag, q))
wdog = mp.Process(target=watchdog, args=(q,))
# run the watchdog as daemon so it terminates with the main process
wdog.daemon = True
workr.start()
print ("[MAIN]: starting process P1")
wdog.start()
# Poll the queue
while True:
msg = q.get()
# if msg == "KILL WATCHDOG":
if msg == "KILL WORKER":
print ("[MAIN]: Terminating slacking WORKER")
workr.terminate()
time.sleep(0.1)
if not workr.is_alive():
print ("[MAIN]: WORKER is a goner")
workr.join(timeout=1.0)
print ("[MAIN]: Joined WORKER successfully!")
q.close()
break # watchdog process daemon gets terminated
if __name__ == '__main__':
main()
只需输入文件名而不是train_model_parallel
:
kill -9 `ps -ef | grep train_model_parallel.py | grep -v grep | awk '{print $2}'`
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.