简体   繁体   English

在使用显式循环的矩阵乘法中,本征比 Fortran 慢得多

[英]Eigen is much slower than Fortran in matrix multiplication using an explicit loop

I tried to rewrite code from Fortran to C++ with a 2000*2000 matrix multiplication implements through Eigen library.我尝试通过 Eigen 库使用 2000*2000 矩阵乘法实现将代码从 Fortran 重写为 C++。 I found that for loop in Eigen is much slower (>3x) than do loop in Fortran.我发现 Eigen 中的 for 循环比 Fortran 中的 do 循环慢得多(>3x)。 The codes are listed below:代码如下:

test.f90测试.f90

program main
implicit none
integer :: n,i,j,k
integer :: tic,toc
real(8),ALLOCATABLE ::a(:,:),b(:,:),c(:,:)
real(8) :: s

n = 2000
allocate(a(n,n),b(n,n),c(n,n))
do i=1,n
    do j =1,n
        a(j,i) = i * 1.0
        b(j,i) = i * 1.0
    enddo
enddo

call system_clock(tic)
do j=1,n
    do i=1,n
        s = 0.0
        do k=1,n
            s = s + a(i,k) * b(k,j)
        enddo
        c(i,j) = s
    enddo
enddo
call system_clock(toc)
print*,'Fortran with loop:', (toc - tic) / 1000.0

call system_clock(tic)
c = matmul(a,b)
call system_clock(toc)
print*,'Fortran with matmul:', (toc - tic) / 1000.0


DEALLOCATE(a,b,c)
end

test.cpp测试.cpp

#include<Eigen/Core>
#include<time.h>
#include<iostream>
using Eigen::MatrixXd;

int main(){
    int n = 2000;
    MatrixXd a(n,n),b(n,n),c(n,n);
    for(int i=0;i<n;i++){
    for(int j=0;j<n;j++){
            a(i,j) = i * 1.0;
            b(i,j) = j * 1.0;
        }
    }
    clock_t tic,toc;
    tic = clock();
    for(int j=0;j<n;j++){
        for(int i=0;i<n;i++){
            double s= 0.0;
            for(int k=0;k<n;k++){
                s += a(i,k) * b(k,j);
            }
            c(i,j) = s;
        }
    }
    toc = clock();
    std::cout << (double)((toc - tic)) / CLOCKS_PER_SEC << std::endl;

    tic = clock();
    c=  a * b;
    toc = clock();
    std::cout << (double)((toc - tic)) / CLOCKS_PER_SEC << std::endl;
}

Compiled by(with gcc-8.4, in Ubuntu-18.04)编译者(使用 gcc-8.4,在 Ubuntu-18.04 中)

gfortran test.f90 -O3 -march=native -o testf
g++ test.cpp -O3 -march=native -I/path/to/eigen -o testcpp 

And I get results:我得到了结果:

Fortran with loop:   10.9700003
Fortran with matmul:   0.834999979
Eigen with loop: 38.2188
Eigen with *: 0.40625

The internal implementation is of comparable speed, but why Eigen is much slower for the loop implementation?内部实现速度相当,但为什么 Eigen 对于循环实现要慢得多?

The biggest problem with the loops is that they are not done in the proper order for either C++ (which should be row-major), or Fortran (which should be column-major).循环的最大问题是,对于 C++(应该是行优先)或 Fortran(应该是列优先),它们没有按照正确的顺序完成。 This gives you a large performance hit, especially for large matrices.这会给您带来很大的性能影响,尤其是对于大型矩阵。

The nativemul implementation by John Alexiou (with dot_product ) has the same problem, so I am very surprised that he claims it's faster. John Alexiou 的nativemul实现(带有dot_product )也有同样的问题,所以我很惊讶他声称它更快。 (And I find that it isn't; see below. Maybe his (intel?) compiler rewrites the code to use matmul internally.) (我发现它不是;见下文。也许他的(英特尔?)编译器重写了代码以在内部使用 matmul。)

This is the correct loop order for Fortran:这是 Fortran 的正确循环顺序:

    c = 0
    do j=1,n
        do k=1,n
            do i=1,n
                c(i,j) = c(i,j) + a(i,k) * b(k,j)
            enddo
        enddo
    enddo

With gfortran version 10.2.0, and compiled with -O3, I get使用 gfortran 版本 10.2.0,并使用 -O3 编译,我得到

 Fortran with original OP's loop:   53.5190010    
 Fortran with John Alexiou's nativemul:   53.4309998    
 Fortran with correct loop:   11.0679998    
 Fortran with matmul:   2.36999989    

A correct loop in C++ should give you similar performance. C++ 中的正确循环应该会给您类似的性能。

Obviously matmul/BLAS are much faster for large matrices.显然 matmul/BLAS 对于大型矩阵来说要快得多。

In the Fortran code I saw the same problem, but then I moved the matrix multiplication in a subroutine and the resultant speed was almost as good as matmul .在 Fortran 代码中,我看到了同样的问题,但随后我将矩阵乘法移动到子程序中,结果速度几乎与matmul一样好。 I also compared to BLAS Level 3 function.我还比较了 BLAS 3 级 function。

Fortran with loop:   9.220000
Fortran with matmul:   8.450000
Fortran with blas3:   2.050000

and the code to produce it以及生成它的代码

program ConsoleMatMul
use BLAS95
implicit none
integer :: n,i,j
integer :: tic,toc
real(8),ALLOCATABLE :: a(:,:),b(:,:),c(:,:),xe(:,:)

n = 2000
allocate(a(n,n),b(n,n),c(n,n),xe(n,n))
do i=1,n
    do j =1,n
        a(j,i) = i * 1.0
        b(j,i) = i * 1.0
    enddo
enddo

call system_clock(tic)
call nativemul(a,b,c)
call system_clock(toc)
print*,'Fortran with loop:', (toc - tic) / 1000.0

call system_clock(tic)
c = matmul(a,b)
call system_clock(toc)
print*,'Fortran with matmul:', (toc - tic) / 1000.0
c = b
xe = 0d0
call system_clock(tic)
call gemm(a,c,xe) ! BLAS MATRIX/MATRIX MUL
call system_clock(toc)
print*,'Fortran with blas3:', (toc - tic) / 1000.0

DEALLOCATE(a,b,c)

contains

pure subroutine nativemul(a,b,c)
real(8), intent(in), allocatable :: a(:,:), b(:,:)
real(8), intent(out), allocatable :: c(:,:)
real(8) :: s
integer :: n, i,j,k
    n = size(a,1)
    if (.not. allocated(c)) allocate(c(n,n))
    do j=1,n
        do i=1,n
            s = 0.0d0
            do k=1,n
                s = s + a(i,k) * b(k,j)
            end do
            c(i,j) = s
        end do
    end do
end subroutine    

end program ConsoleMatMul

before I moved the code into a subroutine I got在我将代码移入子程序之前,我得到了

Fortran with loop:   85.450000

Update the native multiplication reaches matmul levels (or exceeds it) when the inner loop is replaced by a dot_product() .当内部循环被替换为dot_product()时,更新本机乘法达到matmul级别(或超过它)。

pure subroutine nativemul(a,b,c)
real(8), intent(in) :: a(:,:), b(:,:)
real(8), intent(out) :: c(:,:)
integer :: n, i,j
    n = size(a,1)
    do j=1,n
        do i=1,n
            c(i,j) = dot_product(a(i,:),b(:,j))
            ! or  = sum(a(i,:)*b(:,j))
        end do
    end do
end subroutine    

C++ pre-increment is faster than post-increment... C++ 前增量比后增量快...

for(int j=0;j<n;++j){
        for(int i=0;i<n;++i){
            double s= 0.0;
            for(int k=0;k<n;++k){
                s += a(i,k) * b(k,j);
            }
            c(i,j) = s;
        }
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 对于矩阵乘法,Eigen + MKL比Matlab慢 - Eigen + MKL slower than Matlab for matrix multiplication C ++ Eigen稀疏矩阵乘法比python scipy.sparse慢得多 - C++ Eigen Sparse Matrix multiplication much slower than python scipy.sparse 基准矩阵乘法性能:C ++(特征)比Python慢​​得多 - Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python 矩阵乘法的特征码比使用 std::vector 的循环乘法运行速度慢 - Eigen code for matrix multiplication running slower than looped multiplication using std::vector 为什么矩阵加法比本征矩阵向量乘法慢? - Why Matrix Addition is slower than Matrix-Vector Multiplication in Eigen? 为什么犰狳矩阵计算比Fortran慢得多 - why armadillo matrix computation is much slower than Fortran 为什么Strassen矩阵乘法比标准矩阵乘法慢得多? - Why is Strassen matrix multiplication so much slower than standard matrix multiplication? 特征密集矩阵 * 密集向量乘法应该比 GSL 慢 7 倍吗? - Should Eigen dense matrix * dense vector multiplication be 7 times slower than GSL? 使用 Eigen 的复矩阵矩阵乘法 - Complex Matrix matrix multiplication using Eigen 为什么增强矩阵乘法比我的慢? - Why is boosts matrix multiplication slower than mine?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM