[英]python, multiprocessing and dmtcp: checkpointing one process in Pool?
Is it possible to use python's integration of dmtcp to checkpoint a child process in parallel execution? 是否可以使用python的dmtcp集成在并行执行中检查子进程?
My situation is as follows: I have a multiprocessing.Pool with several workers receiving async jobs (using apply_async). 我的情况如下:我有一个multiprocessing.Pool,有几个工作人员接收异步作业(使用apply_async)。 Certain big jobs require all the resources (cpu cores & memory).
某些大型工作需要所有资源(CPU内核和内存)。 When one of these jobs is accepted, I'd like to checkpoint all pending processes, kick them out execution, launch the big job and finally resume the checkpointed processes.
当其中一项作业被接受时,我想检查所有待处理的流程,将它们踢出执行,启动大型作业,最后恢复已检查的流程。
If you start your python program using dmtcp_launch python ...
or dmtcp_launch ./myapp.py
, all child processes created by the main process are automatically under checkpoint control. 如果使用
dmtcp_launch python ...
或dmtcp_launch ./myapp.py
启动python程序,则由主进程创建的所有子进程都将自动处于检查点控制之下。 Thus, when you try to checkpoint the computation from within your main process, all other processes are checkpointed as well. 因此,当您尝试从主流程中检查计算时,所有其他流程也将被检查。
I am not too familiar with multiprocessing.Pool to make detailed comments on that front, but from what I understood in one quick minute, you don't want to checkpoint your main process (scheduler). 我对多处理不太熟悉,可以在这方面做详细的评论,但是据我了解,很快,您就不想检查主要流程(调度程序)了。 However, DMTCP will checkpoint restart the entire computation (including the scheduler) as a single unit.
但是,DMTCP将检查点作为单个单元重新启动整个计算(包括调度程序)。 Is that acceptable?
可以接受吗? If not, the alternative is to not launch the scheduler under DMTCP control, but modify it to launch only the child/slave processes under checkpoint control.
如果不是,则替代方法是不在DMTCP控制下启动调度程序,而是对其进行修改以仅在检查点控制下启动子进程/从属进程。 I am not sure if that's something you can do in you application.
我不确定这是否可以在您的应用程序中完成。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.