内核运行时 MVAPICH 在 CUDA 内存上死锁

Question

I try to get a MPI-CUDA program working with MVAPICH CUDA8.我尝试让 MPI-CUDA 程序与 MVAPICH CUDA8 一起工作。 I did run the program successfully with openMPI before but I want to test if I get better performance with MVAPICH.我之前确实使用 openMPI 成功运行了该程序，但我想测试是否使用 MVAPICH 获得了更好的性能。 Unfortunately the program gets stuck in MPI_Isend if a CUDA kernel is running at the same time when using MVAPICH.不幸的是，如果在使用 MVAPICH 时同时运行 CUDA 内核，程序就会卡在 MPI_Isend 中。

I downloaded MVAPICH2-2.2 and built it from the source with the configuration flags我下载了 MVAPICH2-2.2 并使用配置标志从源代码构建它

--enable-cuda --disable-mcast --enable-cuda --disable-mcast

to enable MPI calls on cuda memory.在 cuda 内存上启用 MPI 调用。 mcast was disabled because I could not compile it without the flag. mcast 被禁用，因为没有标志我无法编译它。

I used the following flags before running the application:在运行应用程序之前，我使用了以下标志：

export MV2_USE_CUDA=1
export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/gdrcopy/
export MV2_USE_GPUDIRECT=1

MPI_Isend/recv work fine when at the same time no CUDA kernel is running. MPI_Isend/recv 在没有 CUDA 内核运行的同时工作正常。 But in my program it is important that MPI sends and receives data from and to GPU memory, while a kernel is running.但是在我的程序中，重要的是在内核运行时 MPI 向 GPU 内存发送和接收数据。

I came up with two possible reasons for that behavior.对于这种行为，我想出了两个可能的原因。 First, MVAPICH tries to run his own CUDA kernel for some reason to send data from GPU memory and this kernel does not get scheduled because the GPU is already fully utilized.首先，MVAPICH 出于某种原因尝试运行他自己的 CUDA 内核以从 GPU 内存发送数据，但由于 GPU 已被充分利用，因此该内核没有得到调度。 Second possibility: MVAPICH uses cudaMemcpy somewhere (not the async version), which blocks until the kernel finishes execution.第二种可能性：MVAPICH 在某处（不是异步版本）使用 cudaMemcpy，它会阻塞直到内核完成执行。

Could someone confirm one of my assumptions?有人可以证实我的假设之一吗？ And is there a flag in MVAPICH that solves this problem that I am not aware of? MVAPICH 中是否有一个标志可以解决我不知道的这个问题？

EDIT:编辑：

Here a "simpel" code that illustrates my problem.这是说明我的问题的“简单”代码。 When executing the code with openMPI, it executes and terminates correctly.使用 openMPI 执行代码时，它会正确执行和终止。 With mvapich2 it deadlocks at the marked MPI_Send function.使用 mvapich2，它会在标记的 MPI_Send 函数处死锁。

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <mpi.h>


__global__ void kernel(double * buffer, int rank)
{
    volatile double *buf = buffer;
    if(rank == 0){
        while(buf[0] != 3){}
    } else {
        while(buf[0] != 2){}
    }
}


int main(int argc, char **argv)
{
    double host_buffer[1];
    MPI_Init(&argc, &argv);
    int world_size, world_rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    printf("Im rank %d\n", world_rank);
    cudaSetDevice(world_rank);

    double * dev_buffer;
    cudaError_t err = cudaMalloc(&dev_buffer, sizeof(double));
    if(world_rank == 0){
        host_buffer[0] = 1;
        cudaError_t err = cudaMemcpy(dev_buffer, host_buffer, sizeof(double), cudaMemcpyHostToDevice);
        MPI_Send(dev_buffer, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
        printf("[%d]First send does not deadlock\n", world_rank);
    }else {
        MPI_Recv(dev_buffer, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("[%d]Received first message\n", world_rank);
    }

    cudaStream_t stream, kernel_stream;
    cudaStreamCreate(&stream);
    cudaStreamCreate(&kernel_stream);

    printf("[%d]launching kernel\n", world_rank);
    kernel<<<208, 128, 0, kernel_stream>>>(dev_buffer, world_rank);

    if(world_rank == 0){
        //rank 0
        host_buffer[0] = 2;
        cudaMemcpyAsync(
            dev_buffer, host_buffer, sizeof(double),
            cudaMemcpyHostToDevice,
            stream
        );
        cudaStreamSynchronize(stream);

        printf("[%d]Send message\n", world_rank);
        MPI_Send(dev_buffer, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); //mvapich2 deadlocks here
        printf("[%d]Message sent\n", world_rank);

        printf("[%d]Receive message\n", world_rank);
        MPI_Recv(dev_buffer, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("[%d]Message received\n", world_rank);

        cudaStreamSynchronize(kernel_stream);
        printf("[%d]kernel finished\n", world_rank);

    } else {
        //rank 1
        printf("[%d]Receive message\n", world_rank);
        MPI_Recv(dev_buffer, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("[%d]Message received\n", world_rank);

        cudaStreamSynchronize(kernel_stream);
        printf("[%d]kernel finished\n", world_rank);

        host_buffer[0] = 3;
        cudaMemcpyAsync(
            dev_buffer, host_buffer, sizeof(double),
            cudaMemcpyHostToDevice,
            stream
        );
        cudaStreamSynchronize(stream);

        printf("[%d]Send message\n", world_rank);
        MPI_Send(dev_buffer, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
        printf("[%d]Message sent\n", world_rank);

    }
    printf("[%d]Stopped execution\n", world_rank);
    MPI_Finalize();
}

Answer 1

I got back to this problem and used gdb to debug the code.我回到这个问题并使用 gdb 调试代码。

Apparently, the problem is the eager protocol of MVAPICH2 implemented in src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c.显然，问题在于 src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c 中实现的 MVAPICH2 的 Eager 协议。 The eager protocol uses a cuda_memcpy without async, which blocks until the kernel execution finishes. Eager 协议使用没有 async 的 cuda_memcpy，它会阻塞直到内核执行完成。

The program posted in the question runs fine by passing MV2_IBA_EAGER_THRESHOLD 1 to mpirun.通过将 MV2_IBA_EAGER_THRESHOLD 1 传递给 mpirun，问题中发布的程序运行良好。 This prevents MPI to use the eager protocol and uses the rendez-vous protocol instead.这可以防止 MPI 使用 Eager 协议，而是使用集合点协议。

Patching the MVAPICH2 source code does solve the problem as well.修补 MVAPICH2 源代码也解决了这个问题。 I changed the synchronous cudaMemcpys to cudaMemcpyAsync in the files我将文件中的同步 cudaMemcpys 更改为 cudaMemcpyAsync

src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c
src/mpid/ch3/channels/mrail/src/gen2/ibv_recv.c src/mpid/ch3/channels/mrail/src/gen2/ibv_recv.c
src/mpid/ch3/src/ch3u_request.c src/mpid/ch3/src/ch3u_request.c

The change in the third file is only needed for MPI_Isend/MPI_Irecv.只有 MPI_Isend/MPI_Irecv 需要更改第三个文件。 Other MPI functions might need some additional code changes.其他 MPI 函数可能需要一些额外的代码更改。

内核运行时 MVAPICH 在 CUDA 内存上死锁

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-03-20 11:19:15

内核运行时 MVAPICH 在 CUDA 内存上死锁

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-03-20 11:19:15

解决方案1
0 已采纳 2017-03-20 11:19:15