[英]Performance comparison MPI vs OpenMP
I have a very strange problem. 我有一个非常奇怪的问题。 I even do not know if I can provide you all the information you need to answer my question;
我甚至不知道我是否能为您提供回答我问题所需的所有信息; in case something is missing, please let me know.
如果遗漏了什么,请告诉我。
I run a code like this using MPI: 我使用MPI运行这样的代码:
#include <mpi.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
double *gradient_D = new double[K*M];
double *DX = new double[M*N];
double gradientD_time = MPI_Wtime();
for (int j = 0; j < K; j++){
for (int i = 0; i < M; i++){
gradient_D[j*M+i] = 0;
for (int k = 0; k < n; k++)
gradient_D[i+M*j] += DX[i+k*M];
}
}
double gradientD_total_time = (MPI_Wtime() - gradientD_time);
printf("Gradient D total = %f \n", gradientD_total_time);
It odes not really matter the meaning of the code: I am just running three for loops and evaulating the CPU time. 它对代码的含义并不重要:我只是运行三个for循环并且调整CPU时间。 In the cmake I wrote the following commands:
在cmake中,我编写了以下命令:
project(mpi_algo)
cmake_minimum_required(VERSION 2.8)
set(CMAKE_CXX_COMPILER "mpicxx")
set(CMAKE_SHARED_LIBRARY_LINK_CXX_FLAGS)
set(CMAKE_CXX_FLAGS "-cxx=icpc -mkl=sequential")
add_executable(mpi_algo main.cpp)
and I run the code: 我运行代码:
mpirun -np 1 ./mpi_algo
After that, I run a similar code in which I do the same operations, but using OpenMP instead of MPI: 之后,我运行一个类似的代码,我在其中执行相同的操作,但使用OpenMP而不是MPI:
#include <omp.h>
#include <cmath>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <cstdlib>
#include <fstream>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <algorithm>
#include "mkl.h"
double *gradient_D = new double[K*M];
double *DX = new double[M*N];
double gradientD_time = omp_get_wtime();
for (int j = 0; j < K; j++){
for (int i = 0; i < M; i++){
gradient_D[j*M+i] = 0;
for (int k = 0; k < n; k++)
gradient_D[i+M*j] += DX[i+k*M];
}
}
double gradientD_total_time = (omp_get_wtime() - gradientD_time);
printf("Gradient D total = %f \n", gradientD_total_time);
You can see that there are small differences in the code. 您可以看到代码中存在细微差别。 This is the cmake:
这是cmake:
project(openmp_algo)
cmake_minimum_required(VERSION 2.8)
set(CMAKE_CXX_COMPILER "icc")
set(CMAKE_SHARED_LIBRARY_LINK_CXX_FLAGS)
set(CMAKE_CXX_FLAGS "-qopenmp -mkl=sequential")
add_executable(openmp_algo main.cpp)
and I run the code: 我运行代码:
./openmp_algo ./openmp_algo
Now, what I can not explain is that the code with MPI takes about 1 second to run. 现在,我无法解释的是MPI代码运行大约需要1秒。 The other one, that should be the same, takes about 20 seconds.
另一个应该是相同的,大约需要20秒。
Could you please someone explain me why? 你能不能请别人解释一下为什么?
EDIT: the constants M, N, n, k do not matter for understanding the issue. 编辑:常数M,N,n,k对于理解问题无关紧要。 They just define the dimension of the arrays.
它们只是定义数组的维度。
Since you don't give much details on the environment, I will make a wild guess to try to give an explanation. 既然你没有提供很多关于环境的细节,我会做一个疯狂的猜测,试着给出一个解释。 First, let's make a few remarks:
首先,我们来说几句话:
icc
(odd choice for a C++ code BTW) which optimization level will therefore be the default -O2
(minus the extra optimization seen as not thread-safe by default that using -qopenmp
will disable; icc
编译(C ++代码BTW的奇怪选择),因此优化级别将是默认值-O2
(减去额外的优化,默认情况下看起来不是线程安全的,使用-qopenmp
将禁用; mpicxx
which will call internally icpc
as compiler. mpicxx
编译的,它会在内部调用icpc
作为编译器。 This is the mpicxx
that I suspect is the key here: indeed, mpicxx
is just a wrapper to the actual compiler, which will also set some include path, some library path and list, but also might set some extra optimization options. 这是
mpicxx
我怀疑这里的关键是:的确, mpicxx
是只是一个包装,以实际的编译器,这也将设置一些包括路径,一些库路径和清单,还可以设置一些额外的优化选项。 In some cases for example, the optimization options used while installing the MPI library will be kept inside the mpicxx
wrapper and silently used by default when compiling your codes... 例如,在某些情况下,安装MPI库时使用的优化选项将保存在
mpicxx
包装器中,默认情况下在编译代码时默认使用...
So here is my guess, your mpicxx
set among other the -O3
optimization option and therefore, the compiler will optimize away the loop for MPI, while the default -O2
that you get for your OpenMP code doesn't do it. 所以这是我的猜测,你的
mpicxx
设置其他-O3
优化选项,因此,编译器将优化MPI的循环,而你的OpenMP代码的默认-O2
不会这样做。 Therefore, you're timing pretty-much nothing in the case of your MPI code, while you're timing the actual loop execution with your OpenMP one. 因此,在MPI代码的情况下,你的计时几乎没有什么,而你正在使用OpenMP执行实际的循环执行计时。
Just a guess, but that seems fair enough. 只是一个猜测,但这似乎很公平。 A good test would be to check what a
mpicxx -cxx=icpc -show
would give you... 一个好的测试是检查
mpicxx -cxx=icpc -show
会给你什么...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.