简体   繁体   English

如何使用MPI终止Signal

[英]How to terminate with Signal with MPI

This is my first program I wrote with MPI in c, The program is supposed to terminate within 15 secs, but it did not. 这是我用C在MPI中编写的第一个程序,该程序应该在15秒内终止,但没有。 it did not even go through the if(end_now == 1) statement. 它甚至没有通过if(end_now == 1)语句。 Does anyone know what had happened here? 有人知道这里发生了什么吗? The code is simplified as below: 代码简化如下:

int end_now = 0;


void sig_handler(int signo)
{
    if (signo == SIGUSR1) {
        end_now = 1;
        printf ( "  %8d  %8d\n", current_number, current_total);
    }
}

int main ( int argc, char **argv ){
   int id;
   int count;

   MPI_Init (&argc, &argv);
   MPI_Comm_size (MPI_COMM_WORLD, &count);
   MPI_Comm_rank (MPI_COMM_WORLD, &id);

   signal(SIGUSR1, sig_handler);


   while (1){
       //MPI_Allreduce is called here to sum up the subtotal calculated by child processes
        if (end_now == 1){
            printf("here\n");  //this "here is never printed out"
            break;
        }
    }
    MPI_Finalize ();
    return 0;
}

I am using timeout --signal=USR1 15 mpirun.openmpi -np 2 ./a.out to execute the code on my Mac. 我正在使用超时--signal = USR1 15 mpirun.openmpi -np 2 ./a.out在Mac上执行代码。 Thanks to whoever may help. 感谢任何人的帮助。

You're sending a signal to mpirun , not your executable. 您正在向mpirun发送信号,而不是向可执行文件发送信号。 Since mpirun itself doesn't have a SIGUSR1 handler, nothing happens. 由于mpirun本身没有SIGUSR1处理程序,因此不会发生任何事情。

BTW, using signals with MPI programs is not something you want to do. 顺便说一句,将信号与MPI程序一起使用并不是您要做的事情。 MPI programs rely on multiple invocations running in lockstep, which doesn't match the asynchronous and per-process nature of signals. MPI程序依赖于以锁步方式运行的多个调用,这与信号的异步和按进程性质不匹配。

Sneftel is right. Sneftel是正确的。 Gilles Gouaillardet is also VERY right. Gilles Gouaillardet也非常正确。 I want to add some other info. 我想添加一些其他信息。

Even if you send the signal to actual program and not to "mpirun", then you possibly send it to ONE only of your processes and not to all of them. 即使您将信号发送到实际程序而不是“ mpirun”,那么也可能仅将信号发送给您的所有进程之一,而不是所有进程。

Yes, signals are not right thing to do in MPI programs. 是的,在MPI程序中信号不正确。 But even if you want to use them, you should first debug whatever processes get them and who of them get them. 但是,即使您想使用它们,也应该首先调试获取它们的过程以及由谁获取的过程。

Insert "printf" directly into signal handler. 将“ printf”直接插入信号处理程序。 Print something like "MPI process number %d got the signal" and insert MPI_COMM_RANK into this printf. 打印类似“ MPI进程号%d收到信号”之类的内容,然后将MPI_COMM_RANK插入此printf中。 (UPD 2018-04-27 7:31 MSK: sorry, I didn't noticed you already have such printf in your code.) (Note: I think "printf" in MPI programs allowed in first process only, and using "printf" in other processes is probably bad idea, but for debugging purposes will go. Also, I think "printf" directly from signal handler is bad idea, but, again, for debugging purposes will go.) (UPD 2018-04-27 7:31 MSK:抱歉,我没有注意到您的代码中已经有这样的printf 。)(注意:我认为MPI程序中的“ printf”仅在第一个过程中允许,并且使用“ printf” “在其他进程中可能不是个好主意,但出于调试目的而已。此外,我认为直接从信号处理程序中获取“ printf”是一个坏主意,但出于调试目的也将成为现实。)

You will determine if your processes get the signal and which of them. 您将确定您的进程是否能收到信号,以及哪个信号。

If you don't satisfied with results, then try different programs instead of gtimeout. 如果您对结果不满意,请尝试使用其他程序代替gtimeout。 For example, "timeout" from GNU Coreutils. 例如,GNU Coreutils中的“超时”。 (Well, this is Mac, I'm not sure, GNU Coreutils is available form Mac, but I still think you can find SOME "timeout".) (嗯,这是Mac,我不确定GNU Coreutils是否可以从Mac使用,但我仍然认为您可以找到某些“超时”。)

Then: you didn't describe your setup in question. 然后:您没有描述您的设置。 Does your MPI programs run on different hosts or on one? 您的MPI程序是在不同的主机上运行还是在一个主机上运行? Does MPI "programs" really implemented as separate programs or as threads? MPI“程序”是否真的以单独的程序或线程的形式实现? Which MPI implementation you use and which version? 您使用哪个MPI实现和哪个版本? If you don't know how MPI starts your processes, at least say us, how you installed your MPI implementation and how you configured it. 如果您不知道MPI如何启动您的流程,至少可以告诉我们,如何安装MPI实施以及如何配置它。

Or even you can do without any "timeout" or "gtimeout" at all. 甚至可以完全不使用任何“超时”或“ gtimeout”。 Just type this in one console: 只需在一个控制台中输入以下内容即可:

sh -c 'echo $$ > ~/pid-of-mpirun; exec ~/opt/usr/local/bin/mpirun -np 2 ./a.out'

This will run "mpirun" while storing its PID into ~/pid-of-mpirun. 这将在将其PID存储到〜/ pid-of-mpirun的同时运行“ mpirun”。 And run in parallel in another terminal (of course, you don't need to run this command exactly in the same moment): 并在另一个终端中并行运行(当然,您不需要完全在同一时间运行此命令):

sleep 15; kill -USR1 $(cat ~/pid-of-mpirun)

This will want 15 secs and send USR1 to process which PID is in ~/pid-of-mpirun . 这将需要15秒,然后发送USR1来处理〜/ pid-of-mpirun中的PID。

But all this will probably send USR1 to "mpirun" and not to actual processes (I am not sure, test this!). 但是所有这一切可能会将USR1发送给“ mpirun”,而不是发送给实际进程(我不确定,测试一下!)。 How to send to actual processes? 如何发送到实际流程? Well, you can read manual page for "kill" and try to understand how to send a signal to whole process group and not to just one process. 好吧,您可以阅读手册页中的“ kill”,并尝试了解如何向整个过程组而不是仅向一个过程发送信号。

Also, you can write your PID into some file directly inside your C program. 同样,您可以将PID直接写入C程序内的某个文件中。

Example: 例:

#include <stdio.h>
#include <unistd.h> // Mac is one of UNIX systems, so we have unistd.h
// ...
FILE *fout = fopen("~/my-pid", "w"); fprintf(fout, "%d\n", getpid); fclose(fout);

Of course, you should somehow make sure you create different files in different processes. 当然,您应该以某种方式确保在不同的过程中创建不同的文件。 For example, generate file names from MPI_COMM_RANK. 例如,从MPI_COMM_RANK生成文件名。

还应将end_now声明为volatile否则编译器可能会优化将永远运行的主循环。

我建议不要在信号处理程序中使用printf,因为printf是不可重入的函数,在某些平台上这可能导致程序崩溃。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM