并发非重叠 pwrite() 到使用多个 MPI 进程安装在 NFS 上的文件

Question

I have a computational fluid dynamic code where I am coding a parallel read and write implementation.我有一个计算流体动力学代码，我正在编写并行读写实现。 What I want to achieve is for multiple MPI processes to open the same file and write data to it (there is no overlap of data, I use pwrite() with offset information).我想要实现的是多个 MPI 进程打开同一个文件并向其写入数据（没有数据重叠，我使用 pwrite() 和偏移信息）。 This seems to be working fine when the two MPI processes are on the same computing node.当两个 MPI 进程在同一个计算节点上时，这似乎工作正常。 However, when I use 2 or more computing nodes, some of the data does not reach the hard-drive.但是，当我使用 2 个或更多计算节点时，一些数据无法到达硬盘。 To demonstrate this, I have written the following C program which I compile using mpicc (my MPI distribution is MPICH):为了证明这一点，我编写了以下 C 程序，我使用 mpicc 编译（我的 MPI 分发版是 MPICH）：

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>

long _numbering(long i,long j,long k, long N) {
  return (((i-1)*N+(j-1))*N+(k-1));
}

int main(int argc, char **argv)
{
  int   numranks, rank,fd,dd;
  long i,j,k,offset,N;
  double value=1.0;
  MPI_Init(NULL,NULL);
  MPI_Comm_size(MPI_COMM_WORLD, &numranks);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  N=10;
  offset=rank*N*N*N*sizeof(double);
  fd=-1;
  printf("Opening file datasetparallel.dat\n");
  //while(fd==-1) {fd = open("datasetparallel.dat", O_RDWR | O_CREAT | O_SYNC,0666);}
  while(fd==-1) {fd = open("datasetparallel.dat", O_RDWR | O_CREAT,0666);}
  //while(dd==-1) {fd = open("/homeA/Desktop/", O_RDWR ,0666);}

  for(i=1;i<=N;i++) {
    for(j=1;j<=N;j++) {
      for(k=1;k<=N;k++) {
        if(pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset)!=8) perror("datasetparallel.dat");
        //pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset);
        value=value+1.0;
      }
    }
  }
  //if(close(fd)==-1) perror("datasetparallel.dat");
  fsync(fd); //fsync(dd);
  close(fd); //close(dd);
 
  printf("Done writing in parallel\n");
  if(rank==0) {
    printf("Beginning serial write\n");
    int ranknum;
    fd=-1;
    value=1.0;
    while(fd==-1) {fd = open("datasetserial.dat", O_RDWR | O_CREAT,0666);}
    for(ranknum=0;ranknum<numranks;ranknum++){
      offset=ranknum*N*N*N*sizeof(double); printf("Offset for rank %d is %ld\n",ranknum,offset);
      printf("writing for rank=%d\n",ranknum);
      for(i=1;i<=N;i++) {
        for(j=1;j<=N;j++) {
          for(k=1;k<=N;k++) {
            if(pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset)!=8) perror("datasetserial.dat");
            //pwrite(fd,&value,sizeof(double),_numbering(i,j,k,N)*sizeof(double)+offset);
            value=value+1.0;
          }
        }
      }
      value=1.0;
    }
    //if(close(fd)==-1) perror("datasetserial.dat");
    fsync(fd);
    close(fd);
    printf("Done writing in serial\n");
  }
  MPI_Finalize();
  return 0;
}

The above program writes doubles in ascending sequence to a file.上面的程序按升序将双精度数写入文件。 Each MPI process writes the same numbers(1.0 to 1000.0) but to different regions of the file.每个 MPI 进程将相同的数字（1.0 到 1000.0）写入文件的不同区域。 For example, rank 0 writes 1.0 to 1000.0, and rank 1 writes 1.0 to 1000.0 beginning from the location just after rank 0 wrote 1000.0.例如，rank 0 写入 1.0 到 1000.0，rank 1 从 rank 0 写入 1000.0 之后的位置开始写入 1.0 到 1000.0。 The program outputs a file named datasetparallel.dat which has been written through concurrent pwrite()s.该程序输出一个名为 datasetparallel.dat 的文件，该文件已通过并发 pwrite()s 写入。 It also outputs datasetserial.dat for reference to compare with the datasetparallel.dat file to check its integrity (I do this by using the cmp command in the terminal).它还输出 datasetserial.dat 以供参考，以与 datasetparallel.dat 文件进行比较以检查其完整性（我通过在终端中使用 cmp 命令执行此操作）。 When a discrepancy is found using cmp, I check the contents of the files using the od command:当使用 cmp 发现差异时，我使用 od 命令检查文件的内容：

od -N <byte_number> -tfD <file_name>

For example, I found some missing data (holes in the file) using the above program.例如，我使用上述程序发现了一些丢失的数据（文件中的漏洞）。 In the parallelly written file, the output using od command:在并行写入的文件中，output 使用od命令：

.
.
.
0007660                      503                      504
0007700                      505                      506
0007720                      507                      508
0007740                      509                      510
0007760                      511                      512
0010000                        0                        0
*
0010620                        0
0010624

while in the reference file written in serial, the output from the od command:而在串行编写的参考文件中，来自od命令的 output：

.
.
.
0007760                      511                      512
0010000                      513                      514
0010020                      515                      516
0010040                      517                      518
0010060                      519                      520
0010100                      521                      522
0010120                      523                      524
0010140                      525                      526
0010160                      527                      528
0010200                      529                      530
0010220                      531                      532
0010240                      533                      534
0010260                      535                      536
0010300                      537                      538
0010320                      539                      540
0010340                      541                      542
0010360                      543                      544
0010400                      545                      546
0010420                      547                      548
0010440                      549                      550
0010460                      551                      552
0010500                      553                      554
0010520                      555                      556
0010540                      557                      558
0010560                      559                      560
0010600                      561                      562
0010620                      563                      564
.
.
.

So far, the only way to fix this seems to use the POSIX open() function with the O_SYNC flag, which ensures that file is written physically to the hard drive, but this seems to be impractically slow.到目前为止，解决此问题的唯一方法似乎是使用带有 O_SYNC 标志的 POSIX open() function，以确保将文件物理写入硬盘驱动器，但这似乎慢得不切实际。 Another equally slow approach seems to be using the inbuilt MPI I/O commands.另一种同样缓慢的方法似乎是使用内置的 MPI I/O 命令。 I am not sure why MPI I/O is slow either.我也不确定为什么 MPI I/O 很慢。 The storage has been mounted on NFS using the following flags: rw,nohide,insecure,no_subtree_check,sync,no_wdelay .I have tried calling fsync() on the file and the directory to no avail.存储已使用以下标志安装在 NFS 上： rw,nohide,insecure,no_subtree_check,sync,no_wdelay 。我尝试在文件和目录上调用 fsync() 无济于事。 Thus, I need advice on how to fix this.因此，我需要有关如何解决此问题的建议。

Answer 1

NFS is a horrible file system. NFS 是一个可怕的文件系统。 As you have seen, its caching behavior makes it trivially easy for processes to "false share" a cached block and then corrupt data.如您所见，它的缓存行为使进程很容易“错误地共享”缓存块，然后破坏数据。

If you are stuck with NFS, do the compute in parallel but then do all the I/O from one rank.如果您被 NFS 困住，请并行执行计算，然后从一个等级执行所有 I/O。

A true parallel system like OrangeFS/PVFS ( http://www.orangefs.org ) will help immensely here, especially if start using MPI-IO (you are already using MPI, so you're halfway there.).真正的并行系统，如 OrangeFS/PVFS ( http://www.orangefs.org ) 将在这里提供极大的帮助，尤其是在开始使用 MPI-IO 时（您已经在使用 MPI，所以您已经完成了一半。）。 Lustre is another option, OrangeFS is the simpler of the two to configure. Lustre 是另一种选择，OrangeFS 是两者中更简单的配置。 but I am maybe biased since I used to work on it.但我可能有偏见，因为我曾经工作过。

It's absolutely possible to address random memory in collective I/O.绝对可以在集合 I/O 中寻址随机 memory。 All your data is MPI_DOUBLE so all you need to do is describe the regions with at worst MPI_TYPE_CREATE_HINDEXED and provide the addresses.您的所有数据都是 MPI_DOUBLE，因此您需要做的就是描述最坏 MPI_TYPE_CREATE_HINDEXED 的区域并提供地址。 You'll see a huge increase in performance if for no other reason than you will be issuing one MPI-IO call instead of (if N == 10) 1000. Your data is contiguous in file so you don't even have to worry about file views.如果您将发出一个 MPI-IO 调用而不是 (if N == 10) 1000，那么您将看到性能的巨大提升。您的数据在文件中是连续的，因此您甚至不必担心关于文件视图。

Furthermore, remember how I said "do all your I/O from one process?".此外，还记得我说过“从一个进程完成所有 I/O 吗？”。 this is a little more advanced but if you set the "cb_nodes" hint (how many nodes to use for the "collective buffering" optimization) to 1 MPI-IO will do just that for you.这有点高级，但如果您将“cb_nodes”提示（用于“集体缓冲”优化的节点数）设置为 1 MPI-IO 将为您做到这一点。

并发非重叠 pwrite() 到使用多个 MPI 进程安装在 NFS 上的文件

问题描述

1 个解决方案

解决方案1
1 2021-03-04 18:19:19

并发非重叠 pwrite() 到使用多个 MPI 进程安装在 NFS 上的文件

问题描述

1 个解决方案

解决方案1 1 2021-03-04 18:19:19

解决方案1
1 2021-03-04 18:19:19