性能比较MPI与OpenMP

Question

I have a very strange problem. 我有一个非常奇怪的问题。 I even do not know if I can provide you all the information you need to answer my question; 我甚至不知道我是否能为您提供回答我问题所需的所有信息; in case something is missing, please let me know. 如果遗漏了什么，请告诉我。

I run a code like this using MPI: 我使用MPI运行这样的代码：

#include <mpi.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
double *gradient_D = new double[K*M];
double *DX = new double[M*N];
double gradientD_time = MPI_Wtime();
for (int j = 0; j < K; j++){
    for (int i = 0; i < M; i++){
        gradient_D[j*M+i] = 0;
        for (int k = 0; k < n; k++)
            gradient_D[i+M*j] += DX[i+k*M];
        }   
    }  
double gradientD_total_time = (MPI_Wtime() - gradientD_time);
printf("Gradient D total = %f \n", gradientD_total_time);

It odes not really matter the meaning of the code: I am just running three for loops and evaulating the CPU time. 它对代码的含义并不重要：我只是运行三个for循环并且调整CPU时间。 In the cmake I wrote the following commands: 在cmake中，我编写了以下命令：

project(mpi_algo)
cmake_minimum_required(VERSION 2.8)
set(CMAKE_CXX_COMPILER "mpicxx")
set(CMAKE_SHARED_LIBRARY_LINK_CXX_FLAGS)
set(CMAKE_CXX_FLAGS "-cxx=icpc -mkl=sequential")
add_executable(mpi_algo main.cpp)

and I run the code: 我运行代码：

mpirun -np 1 ./mpi_algo

After that, I run a similar code in which I do the same operations, but using OpenMP instead of MPI: 之后，我运行一个类似的代码，我在其中执行相同的操作，但使用OpenMP而不是MPI：

#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
double *gradient_D = new double[K*M];
double *DX = new double[M*N];
double gradientD_time = omp_get_wtime();
for (int j = 0; j < K; j++){
    for (int i = 0; i < M; i++){
        gradient_D[j*M+i] = 0;
        for (int k = 0; k < n; k++)
            gradient_D[i+M*j] += DX[i+k*M];
        }   
    }  
double gradientD_total_time = (omp_get_wtime() - gradientD_time);
printf("Gradient D total = %f \n", gradientD_total_time);

You can see that there are small differences in the code. 您可以看到代码中存在细微差别。 This is the cmake: 这是cmake：

project(openmp_algo)
cmake_minimum_required(VERSION 2.8)
set(CMAKE_CXX_COMPILER "icc")
set(CMAKE_SHARED_LIBRARY_LINK_CXX_FLAGS)
set(CMAKE_CXX_FLAGS "-qopenmp -mkl=sequential")
add_executable(openmp_algo main.cpp)

and I run the code: 我运行代码：

./openmp_algo ./openmp_algo

Now, what I can not explain is that the code with MPI takes about 1 second to run. 现在，我无法解释的是MPI代码运行大约需要1秒。 The other one, that should be the same, takes about 20 seconds. 另一个应该是相同的，大约需要20秒。

Could you please someone explain me why? 你能不能请别人解释一下为什么？

EDIT: the constants M, N, n, k do not matter for understanding the issue. 编辑：常数M，N，n，k对于理解问题无关紧要。 They just define the dimension of the arrays. 它们只是定义数组的维度。

Answer 1

Since you don't give much details on the environment, I will make a wild guess to try to give an explanation. 既然你没有提供很多关于环境的细节，我会做一个疯狂的猜测，试着给出一个解释。 First, let's make a few remarks: 首先，我们来说几句话：

Your seemingly identical just do nothing, so a clever compiler is fully entitled to optimize away your compute loops and just do the printing; 你看似相同只是什么都不做，所以一个聪明的编译器完全有权优化你的计算循环，只是做打印;
Your OpenMP code is compiled with a vanilla icc (odd choice for a C++ code BTW) which optimization level will therefore be the default -O2 (minus the extra optimization seen as not thread-safe by default that using -qopenmp will disable; 您的OpenMP代码使用vanilla icc编译（C ++代码BTW的奇怪选择），因此优化级别将是默认值-O2 （减去额外的优化，默认情况下看起来不是线程安全的，使用-qopenmp将禁用;
Your MPI code is compiled with a plain mpicxx which will call internally icpc as compiler. 你的MPI代码是用一个普通的mpicxx编译的，它会在内部调用icpc作为编译器。

This is the mpicxx that I suspect is the key here: indeed, mpicxx is just a wrapper to the actual compiler, which will also set some include path, some library path and list, but also might set some extra optimization options. 这是mpicxx我怀疑这里的关键是：的确， mpicxx是只是一个包装，以实际的编译器，这也将设置一些包括路径，一些库路径和清单，还可以设置一些额外的优化选项。 In some cases for example, the optimization options used while installing the MPI library will be kept inside the mpicxx wrapper and silently used by default when compiling your codes... 例如，在某些情况下，安装MPI库时使用的优化选项将保存在mpicxx包装器中，默认情况下在编译代码时默认使用...

So here is my guess, your mpicxx set among other the -O3 optimization option and therefore, the compiler will optimize away the loop for MPI, while the default -O2 that you get for your OpenMP code doesn't do it. 所以这是我的猜测，你的mpicxx设置其他-O3优化选项，因此，编译器将优化MPI的循环，而你的OpenMP代码的默认-O2不会这样做。 Therefore, you're timing pretty-much nothing in the case of your MPI code, while you're timing the actual loop execution with your OpenMP one. 因此，在MPI代码的情况下，你的计时几乎没有什么，而你正在使用OpenMP执行实际的循环执行计时。

Just a guess, but that seems fair enough. 只是一个猜测，但这似乎很公平。 A good test would be to check what a mpicxx -cxx=icpc -show would give you... 一个好的测试是检查mpicxx -cxx=icpc -show会给你什么...

性能比较MPI与OpenMP

问题描述

1 个解决方案

解决方案1
0 2017-03-19 16:34:51

性能比较MPI与OpenMP

问题描述

1 个解决方案

解决方案1 0 2017-03-19 16:34:51

解决方案1
0 2017-03-19 16:34:51