[英]Eigen is much slower than Fortran in matrix multiplication using an explicit loop
I tried to rewrite code from Fortran to C++ with a 2000*2000 matrix multiplication implements through Eigen library.我尝试通过 Eigen 库使用 2000*2000 矩阵乘法实现将代码从 Fortran 重写为 C++。 I found that for loop in Eigen is much slower (>3x) than do loop in Fortran.
我发现 Eigen 中的 for 循环比 Fortran 中的 do 循环慢得多(>3x)。 The codes are listed below:
代码如下:
test.f90测试.f90
program main
implicit none
integer :: n,i,j,k
integer :: tic,toc
real(8),ALLOCATABLE ::a(:,:),b(:,:),c(:,:)
real(8) :: s
n = 2000
allocate(a(n,n),b(n,n),c(n,n))
do i=1,n
do j =1,n
a(j,i) = i * 1.0
b(j,i) = i * 1.0
enddo
enddo
call system_clock(tic)
do j=1,n
do i=1,n
s = 0.0
do k=1,n
s = s + a(i,k) * b(k,j)
enddo
c(i,j) = s
enddo
enddo
call system_clock(toc)
print*,'Fortran with loop:', (toc - tic) / 1000.0
call system_clock(tic)
c = matmul(a,b)
call system_clock(toc)
print*,'Fortran with matmul:', (toc - tic) / 1000.0
DEALLOCATE(a,b,c)
end
test.cpp测试.cpp
#include<Eigen/Core>
#include<time.h>
#include<iostream>
using Eigen::MatrixXd;
int main(){
int n = 2000;
MatrixXd a(n,n),b(n,n),c(n,n);
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
a(i,j) = i * 1.0;
b(i,j) = j * 1.0;
}
}
clock_t tic,toc;
tic = clock();
for(int j=0;j<n;j++){
for(int i=0;i<n;i++){
double s= 0.0;
for(int k=0;k<n;k++){
s += a(i,k) * b(k,j);
}
c(i,j) = s;
}
}
toc = clock();
std::cout << (double)((toc - tic)) / CLOCKS_PER_SEC << std::endl;
tic = clock();
c= a * b;
toc = clock();
std::cout << (double)((toc - tic)) / CLOCKS_PER_SEC << std::endl;
}
Compiled by(with gcc-8.4, in Ubuntu-18.04)编译者(使用 gcc-8.4,在 Ubuntu-18.04 中)
gfortran test.f90 -O3 -march=native -o testf
g++ test.cpp -O3 -march=native -I/path/to/eigen -o testcpp
And I get results:我得到了结果:
Fortran with loop: 10.9700003
Fortran with matmul: 0.834999979
Eigen with loop: 38.2188
Eigen with *: 0.40625
The internal implementation is of comparable speed, but why Eigen is much slower for the loop implementation?内部实现速度相当,但为什么 Eigen 对于循环实现要慢得多?
The biggest problem with the loops is that they are not done in the proper order for either C++ (which should be row-major), or Fortran (which should be column-major).循环的最大问题是,对于 C++(应该是行优先)或 Fortran(应该是列优先),它们没有按照正确的顺序完成。 This gives you a large performance hit, especially for large matrices.
这会给您带来很大的性能影响,尤其是对于大型矩阵。
The nativemul
implementation by John Alexiou (with dot_product
) has the same problem, so I am very surprised that he claims it's faster. John Alexiou 的
nativemul
实现(带有dot_product
)也有同样的问题,所以我很惊讶他声称它更快。 (And I find that it isn't; see below. Maybe his (intel?) compiler rewrites the code to use matmul internally.) (我发现它不是;见下文。也许他的(英特尔?)编译器重写了代码以在内部使用 matmul。)
This is the correct loop order for Fortran:这是 Fortran 的正确循环顺序:
c = 0
do j=1,n
do k=1,n
do i=1,n
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
With gfortran version 10.2.0, and compiled with -O3, I get使用 gfortran 版本 10.2.0,并使用 -O3 编译,我得到
Fortran with original OP's loop: 53.5190010
Fortran with John Alexiou's nativemul: 53.4309998
Fortran with correct loop: 11.0679998
Fortran with matmul: 2.36999989
A correct loop in C++ should give you similar performance. C++ 中的正确循环应该会给您类似的性能。
Obviously matmul/BLAS are much faster for large matrices.显然 matmul/BLAS 对于大型矩阵来说要快得多。
In the Fortran code I saw the same problem, but then I moved the matrix multiplication in a subroutine and the resultant speed was almost as good as matmul
.在 Fortran 代码中,我看到了同样的问题,但随后我将矩阵乘法移动到子程序中,结果速度几乎与
matmul
一样好。 I also compared to BLAS Level 3 function.我还比较了 BLAS 3 级 function。
Fortran with loop: 9.220000
Fortran with matmul: 8.450000
Fortran with blas3: 2.050000
and the code to produce it以及生成它的代码
program ConsoleMatMul
use BLAS95
implicit none
integer :: n,i,j
integer :: tic,toc
real(8),ALLOCATABLE :: a(:,:),b(:,:),c(:,:),xe(:,:)
n = 2000
allocate(a(n,n),b(n,n),c(n,n),xe(n,n))
do i=1,n
do j =1,n
a(j,i) = i * 1.0
b(j,i) = i * 1.0
enddo
enddo
call system_clock(tic)
call nativemul(a,b,c)
call system_clock(toc)
print*,'Fortran with loop:', (toc - tic) / 1000.0
call system_clock(tic)
c = matmul(a,b)
call system_clock(toc)
print*,'Fortran with matmul:', (toc - tic) / 1000.0
c = b
xe = 0d0
call system_clock(tic)
call gemm(a,c,xe) ! BLAS MATRIX/MATRIX MUL
call system_clock(toc)
print*,'Fortran with blas3:', (toc - tic) / 1000.0
DEALLOCATE(a,b,c)
contains
pure subroutine nativemul(a,b,c)
real(8), intent(in), allocatable :: a(:,:), b(:,:)
real(8), intent(out), allocatable :: c(:,:)
real(8) :: s
integer :: n, i,j,k
n = size(a,1)
if (.not. allocated(c)) allocate(c(n,n))
do j=1,n
do i=1,n
s = 0.0d0
do k=1,n
s = s + a(i,k) * b(k,j)
end do
c(i,j) = s
end do
end do
end subroutine
end program ConsoleMatMul
before I moved the code into a subroutine I got在我将代码移入子程序之前,我得到了
Fortran with loop: 85.450000
Update the native multiplication reaches matmul
levels (or exceeds it) when the inner loop is replaced by a dot_product()
.当内部循环被替换为
dot_product()
时,更新本机乘法达到matmul
级别(或超过它)。
pure subroutine nativemul(a,b,c)
real(8), intent(in) :: a(:,:), b(:,:)
real(8), intent(out) :: c(:,:)
integer :: n, i,j
n = size(a,1)
do j=1,n
do i=1,n
c(i,j) = dot_product(a(i,:),b(:,j))
! or = sum(a(i,:)*b(:,j))
end do
end do
end subroutine
C++ pre-increment is faster than post-increment... C++ 前增量比后增量快...
for(int j=0;j<n;++j){
for(int i=0;i<n;++i){
double s= 0.0;
for(int k=0;k<n;++k){
s += a(i,k) * b(k,j);
}
c(i,j) = s;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.