简体   繁体   中英

MPI Send and Recv Hangs with Buffer Size Larger Than 64kb

I am trying to send data from process 0 to process 1. This program succeeds when the buffer size is less than 64kb, but hangs if the buffer gets much larger. The following code should reproduce this issue (should hang), but should succeed if n is modified to be less than 8000.

int main(int argc, char *argv[]){
  int world_size, world_rank,
      count;
  MPI_Status status;


  MPI_Init(NULL, NULL);

  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  if(world_size < 2){
    printf("Please add another process\n");
    exit(1);
  }

  int n = 8200;
  double *d = malloc(sizeof(double)*n);
  double *c = malloc(sizeof(double)*n);
  printf("malloc results %p %p\n", d, c);

  if(world_rank == 0){
    printf("sending\n");
    MPI_Send(c, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
    printf("sent\n");
  }
  if(world_rank == 1){
    printf("recv\n");
    MPI_Recv(d, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status);

    MPI_Get_count(&status, MPI_DOUBLE, &count);
    printf("recved, count:%d source:%d tag:%d error:%d\n", count, status.MPI_SOURCE, status.MPI_TAG, status.MPI_ERROR);
  }

  MPI_Finalize();

}

Output n = 8200;
malloc results 0x1cb05f0 0x1cc0640
recv
malloc results 0x117d5f0 0x118d640
sending

Output n = 8000;
malloc results 0x183c5f0 0x184c000
recv
malloc results 0x1ea75f0 0x1eb7000
sending
sent
recved, count:8000 source:0 tag:0 error:0

I found this question and this question which are similar, but I believe the issue there is with creating deadlocks. I would not expect a similar issue here because each process is performing only one send or receive.

EDIT: Added status checking.

EDIT2: It seems the issue was that I have OpenMPI installed but also installed an implementation of MPI from Intel when I installed MKL. My code was being compiled with the OpenMPI header and libraries, but run with Intel's mpirun. All works as expected when I ensure I run with the mpirun executable from OpenMPI.

The issue was with having both Intel's MPI and OpenMPI installed. I saw that /usr/include/mpi.h was owned by OpenMPI, but mpicc and mpirun were from Intel's implementation:

$ which mpicc
/opt/intel/composerxe/linux/mpi/intel64/bin/mpicc
$ which mpirun
/opt/intel/composerxe/linux/mpi/intel64/bin/mpirun

I was able to solve the issue by running

/usr/bin/mpicc

and

/usr/bin/mpirun

to ensure I used OpenMPI.

Thanks to @Zulan and @gsamaras for the suggestion to check my installation.

The code is fine! I just checked with version 3.1.3 ( mpiexec --version ):

linux16:/home/users/grad1459>mpicc -std=c99 -O1 -o px px.c -lm
linux16:/home/users/grad1459>mpiexec -n 2 ./px
malloc results 0x92572e8 0x9267330
sending
sent
malloc results 0x9dc92e8 0x9dd9330
recv
recved, count:8200 source:0 tag:0 error:1839744

As a result, the problem comes with your installation. Run through the following troubleshoot options:

  1. Check the result of malloc *
  2. Check status

I would bet that the return value of malloc() is NULL , since you mention that it fails if you request more memory. It might be that the system refuses to give that memory.


I was partly correct, the problem came with the installation, but as the OP said:

It seems the issue was that I have OpenMPI installed but also installed an implementation of MPI from Intel when I installed MKL. My code was being compiled with the OpenMPI header and libraries, but run with Intel's mpirun. All works as expected when I ensure I run with the mpirun executable from OpenMPI.

* checking that `malloc` succeeded in C

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM