简体   繁体   English

性能比较MPI与OpenMP

[英]Performance comparison MPI vs OpenMP

I have a very strange problem. 我有一个非常奇怪的问题。 I even do not know if I can provide you all the information you need to answer my question; 我甚至不知道我是否能为您提供回答我问题所需的所有信息; in case something is missing, please let me know. 如果遗漏了什么,请告诉我。

I run a code like this using MPI: 我使用MPI运行这样的代码:

#include <mpi.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
double *gradient_D = new double[K*M];
double *DX = new double[M*N];
double gradientD_time = MPI_Wtime();
for (int j = 0; j < K; j++){
    for (int i = 0; i < M; i++){
        gradient_D[j*M+i] = 0;
        for (int k = 0; k < n; k++)
            gradient_D[i+M*j] += DX[i+k*M];
        }   
    }  
double gradientD_total_time = (MPI_Wtime() - gradientD_time);
printf("Gradient D total = %f \n", gradientD_total_time);

It odes not really matter the meaning of the code: I am just running three for loops and evaulating the CPU time. 它对代码的含义并不重要:我只是运行三个for循环并且调整CPU时间。 In the cmake I wrote the following commands: 在cmake中,我编写了以下命令:

project(mpi_algo)
cmake_minimum_required(VERSION 2.8)
set(CMAKE_CXX_COMPILER "mpicxx")
set(CMAKE_SHARED_LIBRARY_LINK_CXX_FLAGS)
set(CMAKE_CXX_FLAGS "-cxx=icpc -mkl=sequential")
add_executable(mpi_algo main.cpp)

and I run the code: 我运行代码:

mpirun -np 1 ./mpi_algo

After that, I run a similar code in which I do the same operations, but using OpenMP instead of MPI: 之后,我运行一个类似的代码,我在其中执行相同的操作,但使用OpenMP而不是MPI:

#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
double *gradient_D = new double[K*M];
double *DX = new double[M*N];
double gradientD_time = omp_get_wtime();
for (int j = 0; j < K; j++){
    for (int i = 0; i < M; i++){
        gradient_D[j*M+i] = 0;
        for (int k = 0; k < n; k++)
            gradient_D[i+M*j] += DX[i+k*M];
        }   
    }  
double gradientD_total_time = (omp_get_wtime() - gradientD_time);
printf("Gradient D total = %f \n", gradientD_total_time);

You can see that there are small differences in the code. 您可以看到代码中存在细微差别。 This is the cmake: 这是cmake:

project(openmp_algo)
cmake_minimum_required(VERSION 2.8)
set(CMAKE_CXX_COMPILER "icc")
set(CMAKE_SHARED_LIBRARY_LINK_CXX_FLAGS)
set(CMAKE_CXX_FLAGS "-qopenmp -mkl=sequential")
add_executable(openmp_algo main.cpp)

and I run the code: 我运行代码:

./openmp_algo ./openmp_algo

Now, what I can not explain is that the code with MPI takes about 1 second to run. 现在,我无法解释的是MPI代码运行大约需要1秒。 The other one, that should be the same, takes about 20 seconds. 另一个应该是相同的,大约需要20秒。

Could you please someone explain me why? 你能不能请别人解释一下为什么?

EDIT: the constants M, N, n, k do not matter for understanding the issue. 编辑:常数M,N,n,k对于理解问题无关紧要。 They just define the dimension of the arrays. 它们只是定义数组的维度。

Since you don't give much details on the environment, I will make a wild guess to try to give an explanation. 既然你没有提供很多关于环境的细节,我会做一个疯狂的猜测,试着给出一个解释。 First, let's make a few remarks: 首先,我们来说几句话:

  • Your seemingly identical just do nothing, so a clever compiler is fully entitled to optimize away your compute loops and just do the printing; 你看似相同只是什么都不做,所以一个聪明的编译器完全有权优化你的计算循环,只是做打印;
  • Your OpenMP code is compiled with a vanilla icc (odd choice for a C++ code BTW) which optimization level will therefore be the default -O2 (minus the extra optimization seen as not thread-safe by default that using -qopenmp will disable; 您的OpenMP代码使用vanilla icc编译(C ++代码BTW的奇怪选择),因此优化级别将是默认值-O2 (减去额外的优化,默认情况下看起来不是线程安全的,使用-qopenmp将禁用;
  • Your MPI code is compiled with a plain mpicxx which will call internally icpc as compiler. 你的MPI代码是用一个普通的mpicxx编译的,它会在内部调用icpc作为编译器。

This is the mpicxx that I suspect is the key here: indeed, mpicxx is just a wrapper to the actual compiler, which will also set some include path, some library path and list, but also might set some extra optimization options. 这是mpicxx我怀疑这里的关键是:的确, mpicxx是只是一个包装,以实际的编译器,这也将设置一些包括路径,一些库路径和清单,还可以设置一些额外的优化选项。 In some cases for example, the optimization options used while installing the MPI library will be kept inside the mpicxx wrapper and silently used by default when compiling your codes... 例如,在某些情况下,安装MPI库时使用的优​​化选项将保存在mpicxx包装器中,默认情况下在编译代码时默认使用...

So here is my guess, your mpicxx set among other the -O3 optimization option and therefore, the compiler will optimize away the loop for MPI, while the default -O2 that you get for your OpenMP code doesn't do it. 所以这是我的猜测,你的mpicxx设置其他-O3优化选项,因此,编译器将优化MPI的循环,而你的OpenMP代码的默认-O2不会这样做。 Therefore, you're timing pretty-much nothing in the case of your MPI code, while you're timing the actual loop execution with your OpenMP one. 因此,在MPI代码的情况下,你的计时几乎没有什么,而你正在使用OpenMP执行实际的循环执行计时。

Just a guess, but that seems fair enough. 只是一个猜测,但这似乎很公平。 A good test would be to check what a mpicxx -cxx=icpc -show would give you... 一个好的测试是检查mpicxx -cxx=icpc -show会给你什么...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM