简体   繁体   English

这可以是多线程 MPI_Irecv 中最原子的“如果未收到则取消”

[英]Can this be the most Atomic "cancel if not received" in multithreaded MPI_Irecv

The present question is embedded in a multithreaded setting where 'several' (eg 5) threads are working after having each having started listening with MPI_Irecv using as source MPI_ANY_SOURCE .当前问题嵌入在多线程设置中,其中“多个”(例如 5 个)线程在每个线程开始使用MPI_Irecv作为源MPI_ANY_SOURCE进行侦听后正在工作。 Before exiting the function, each thread should check if a message was received or else cancel the request to free up the memory.在退出 function 之前,每个线程应检查是否收到消息,否则取消请求以释放 memory。

The assumption of the message only arriving to one of the N (eg 5) threads is here made, and the problem here referred is that which arises if in the time between (1) checking if a message has arrived and (2) canceling the request if the previous test returned false, indeed a message should arrive.这里假设消息仅到达 N(例如 5)个线程之一,这里提到的问题是在 (1) 检查消息是否已到达和 (2) 取消请求如果先前的测试返回错误,确实应该有消息到达。

As a side note, using a single receiver that writes on an atomically-accessed queue should solve it.作为旁注,使用写入原子访问队列的单个接收器应该可以解决它。 But it implies major code refactoring, and maybe a performance decrease.但这意味着主要的代码重构,并且可能会降低性能。

The question is if the MPI standard provides an answer to this problem and what is it, or else if the following (pseudo) code is indeed sufficient protection.问题是 MPI 标准是否提供了这个问题的答案以及它是什么,或者下面的(伪)代码是否确实是足够的保护。

The proposed solution seems suspicious as logs (see below) only show the combination "irecv not capturing messages + failure to cancel the related request".建议的解决方案似乎很可疑,因为日志(见下文)仅显示“irecv 未捕获消息 + 无法取消相关请求”的组合。 It seems to be no memory build up tho.似乎没有 memory 建立起来。

in main.cppmain.cpp

//...
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE) {
    error_report("[error] The MPI did not provide the requested threading behaviour.");
}
//...

On the relevant function.关于相关的 function。

// Start recieving 
MPI_Irecv(&buffer, 1, MPI_DOUBLE,
                      MPI_ANY_SOURCE,
                      VERTEXVAL_REQUEST_FLAG,
                      MPI_COMM_WORLD,
                      &R);

// some work goes on here ... 

// Before exiting, we check if a message arrived. 

int flag1=-437, flag2=-437; // any initialization

MPI_Status status1, status2;
status2.MPI_ERROR = -999; // again, any initialization
status1.MPI_ERROR = -999;
MPI_Test(&R, &flag1, &status1);

if (flag1 != 1){
    MPI_Cancel(&R);
    MPI_Test_cancelled(&status2, &flag2);
}
if ((flag1 == 1) || ((flag1!=1) && (flag2!=1))) {

    if (flag1 == 1) {
        build_answer(answer, REF, buffer, status1.MPI_SOURCE, MYPROC);
        printf("A request failed to be cancelled, we are assuming we recieved it! we computed val = %f, recieved buffer = %f ; flags12 = %d %d ;  source = %d ; tag = %d; error = %d\n",
           answer, buffer, flag1, flag2, status1.MPI_SOURCE, status1.MPI_TAG, status1.MPI_ERROR);
        std::cout << std::flush;

        MPI_Ssend(&answer, 1, MPI_DOUBLE, status1.MPI_SOURCE, (int) buffer, MPI_COMM_WORLD);

        printf("Completed!\n");
        std::cout << std::flush;

    } else {
        printf("A request failed to be cancelled: will ignore it. Recieved buffer = %f ; flags12 = %d %d ;  source = %d ; tag = %d ; status error = %d\n",
           buffer, flag1, flag2, status2.MPI_SOURCE, status2.MPI_TAG, status2.MPI_ERROR);
        std::cout << std::flush;
    }
}

This 'protection' appears to solve the 1 in 1000 deadlocks that used to arise in the program as the previous version just assumed that failure to cancel meant that the message had arrived.这种“保护”似乎解决了程序中曾经出现的千分之一的死锁,因为以前的版本只是假设取消失败意味着消息已经到达。 In particular, log entries show the following values printed through printf .特别是,日志条目显示通过printf打印的以下值。

 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 22020; source = 2; tag = 0; status error = -183549351 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 0; source = 1; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 0; source = 1; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = -0.000000; flags12 = 0 21998; source = 2; tag = 0; status error = -1563532711 A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000; flags12 = 0 0; source = 0; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000; flags12 = 0 0; source = 0; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 0; source = 0; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 22033; source = 2; tag = 0; status error = -691551655 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 0; source = 0; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000; flags12 = 0 0; source = 1; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 8.000000; flags12 = 0 0; source = 0; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000; flags12 = 0 0; source = 1; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 0; source = 1; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 8.000000; flags12 = 0 0; source = 0; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 0; source = 1; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 0; source = 1; tag = 25001; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = -0.000000; flags12 = 0 21998; source = 2; tag = 0; status error = -1563532711 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000; flags12 = 0 22033; source = 2; tag = 0; status error = -691551655

Check into MPI_Mprobe and MPI_Mrecv which are precisely for your multi-threaded scenario.检查MPI_MprobeMPI_Mrecv ,它们正好适合您的多线程场景。 It should not be necessary to cancel receives.没有必要取消接收。 For details, see https://www.slideshare.net/jsquyres/mpimprobe-is-good-for-you详情见https://www.slideshare.net/jsquyres/mpimprobe-is-good-for-you

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM