在使用显式循环的矩阵乘法中，本征比 Fortran 慢得多

Question

I tried to rewrite code from Fortran to C++ with a 2000*2000 matrix multiplication implements through Eigen library.我尝试通过 Eigen 库使用 2000*2000 矩阵乘法实现将代码从 Fortran 重写为 C++。 I found that for loop in Eigen is much slower (>3x) than do loop in Fortran.我发现 Eigen 中的 for 循环比 Fortran 中的 do 循环慢得多（>3x）。 The codes are listed below:代码如下：

test.f90测试.f90

program main
implicit none
integer :: n,i,j,k
integer :: tic,toc
real(8),ALLOCATABLE ::a(:,:),b(:,:),c(:,:)
real(8) :: s

n = 2000
allocate(a(n,n),b(n,n),c(n,n))
do i=1,n
    do j =1,n
        a(j,i) = i * 1.0
        b(j,i) = i * 1.0
    enddo
enddo

call system_clock(tic)
do j=1,n
    do i=1,n
        s = 0.0
        do k=1,n
            s = s + a(i,k) * b(k,j)
        enddo
        c(i,j) = s
    enddo
enddo
call system_clock(toc)
print*,'Fortran with loop:', (toc - tic) / 1000.0

call system_clock(tic)
c = matmul(a,b)
call system_clock(toc)
print*,'Fortran with matmul:', (toc - tic) / 1000.0


DEALLOCATE(a,b,c)
end

test.cpp测试.cpp

#include<Eigen/Core>
#include<time.h>
#include<iostream>
using Eigen::MatrixXd;

int main(){
    int n = 2000;
    MatrixXd a(n,n),b(n,n),c(n,n);
    for(int i=0;i<n;i++){
    for(int j=0;j<n;j++){
            a(i,j) = i * 1.0;
            b(i,j) = j * 1.0;
        }
    }
    clock_t tic,toc;
    tic = clock();
    for(int j=0;j<n;j++){
        for(int i=0;i<n;i++){
            double s= 0.0;
            for(int k=0;k<n;k++){
                s += a(i,k) * b(k,j);
            }
            c(i,j) = s;
        }
    }
    toc = clock();
    std::cout << (double)((toc - tic)) / CLOCKS_PER_SEC << std::endl;

    tic = clock();
    c=  a * b;
    toc = clock();
    std::cout << (double)((toc - tic)) / CLOCKS_PER_SEC << std::endl;
}

Compiled by(with gcc-8.4, in Ubuntu-18.04)编译者（使用 gcc-8.4，在 Ubuntu-18.04 中）

gfortran test.f90 -O3 -march=native -o testf
g++ test.cpp -O3 -march=native -I/path/to/eigen -o testcpp

And I get results:我得到了结果：

Fortran with loop:   10.9700003
Fortran with matmul:   0.834999979
Eigen with loop: 38.2188
Eigen with *: 0.40625

The internal implementation is of comparable speed, but why Eigen is much slower for the loop implementation?内部实现速度相当，但为什么 Eigen 对于循环实现要慢得多？

Answer 1

The biggest problem with the loops is that they are not done in the proper order for either C++ (which should be row-major), or Fortran (which should be column-major).循环的最大问题是，对于 C++（应该是行优先）或 Fortran（应该是列优先），它们没有按照正确的顺序完成。 This gives you a large performance hit, especially for large matrices.这会给您带来很大的性能影响，尤其是对于大型矩阵。

The nativemul implementation by John Alexiou (with dot_product ) has the same problem, so I am very surprised that he claims it's faster. John Alexiou 的nativemul实现（带有dot_product ）也有同样的问题，所以我很惊讶他声称它更快。 (And I find that it isn't; see below. Maybe his (intel?) compiler rewrites the code to use matmul internally.) （我发现它不是；见下文。也许他的（英特尔？）编译器重写了代码以在内部使用 matmul。）

This is the correct loop order for Fortran:这是 Fortran 的正确循环顺序：

    c = 0
    do j=1,n
        do k=1,n
            do i=1,n
                c(i,j) = c(i,j) + a(i,k) * b(k,j)
            enddo
        enddo
    enddo

With gfortran version 10.2.0, and compiled with -O3, I get使用 gfortran 版本 10.2.0，并使用 -O3 编译，我得到

 Fortran with original OP's loop:   53.5190010    
 Fortran with John Alexiou's nativemul:   53.4309998    
 Fortran with correct loop:   11.0679998    
 Fortran with matmul:   2.36999989

A correct loop in C++ should give you similar performance. C++ 中的正确循环应该会给您类似的性能。

Obviously matmul/BLAS are much faster for large matrices.显然 matmul/BLAS 对于大型矩阵来说要快得多。

Answer 2

In the Fortran code I saw the same problem, but then I moved the matrix multiplication in a subroutine and the resultant speed was almost as good as matmul .在 Fortran 代码中，我看到了同样的问题，但随后我将矩阵乘法移动到子程序中，结果速度几乎与matmul一样好。 I also compared to BLAS Level 3 function.我还比较了 BLAS 3 级 function。

Fortran with loop:   9.220000
Fortran with matmul:   8.450000
Fortran with blas3:   2.050000

and the code to produce it以及生成它的代码

program ConsoleMatMul
use BLAS95
implicit none
integer :: n,i,j
integer :: tic,toc
real(8),ALLOCATABLE :: a(:,:),b(:,:),c(:,:),xe(:,:)

n = 2000
allocate(a(n,n),b(n,n),c(n,n),xe(n,n))
do i=1,n
    do j =1,n
        a(j,i) = i * 1.0
        b(j,i) = i * 1.0
    enddo
enddo

call system_clock(tic)
call nativemul(a,b,c)
call system_clock(toc)
print*,'Fortran with loop:', (toc - tic) / 1000.0

call system_clock(tic)
c = matmul(a,b)
call system_clock(toc)
print*,'Fortran with matmul:', (toc - tic) / 1000.0
c = b
xe = 0d0
call system_clock(tic)
call gemm(a,c,xe) ! BLAS MATRIX/MATRIX MUL
call system_clock(toc)
print*,'Fortran with blas3:', (toc - tic) / 1000.0

DEALLOCATE(a,b,c)

contains

pure subroutine nativemul(a,b,c)
real(8), intent(in), allocatable :: a(:,:), b(:,:)
real(8), intent(out), allocatable :: c(:,:)
real(8) :: s
integer :: n, i,j,k
    n = size(a,1)
    if (.not. allocated(c)) allocate(c(n,n))
    do j=1,n
        do i=1,n
            s = 0.0d0
            do k=1,n
                s = s + a(i,k) * b(k,j)
            end do
            c(i,j) = s
        end do
    end do
end subroutine    

end program ConsoleMatMul

before I moved the code into a subroutine I got在我将代码移入子程序之前，我得到了

Fortran with loop:   85.450000

Update the native multiplication reaches matmul levels (or exceeds it) when the inner loop is replaced by a dot_product() .当内部循环被替换为dot_product()时，更新本机乘法达到matmul级别（或超过它）。

pure subroutine nativemul(a,b,c)
real(8), intent(in) :: a(:,:), b(:,:)
real(8), intent(out) :: c(:,:)
integer :: n, i,j
    n = size(a,1)
    do j=1,n
        do i=1,n
            c(i,j) = dot_product(a(i,:),b(:,j))
            ! or  = sum(a(i,:)*b(:,j))
        end do
    end do
end subroutine

Answer 3

C++ pre-increment is faster than post-increment... C++ 前增量比后增量快...

for(int j=0;j<n;++j){
        for(int i=0;i<n;++i){
            double s= 0.0;
            for(int k=0;k<n;++k){
                s += a(i,k) * b(k,j);
            }
            c(i,j) = s;
        }
    }

在使用显式循环的矩阵乘法中，本征比 Fortran 慢得多

问题描述

3 个解决方案

解决方案1
3 2021-02-18 09:53:32

解决方案2
0 2020-12-19 07:51:16

解决方案3
-1 2021-01-05 14:43:58

在使用显式循环的矩阵乘法中，本征比 Fortran 慢得多

问题描述

3 个解决方案

解决方案1 3 2021-02-18 09:53:32

解决方案2 0 2020-12-19 07:51:16

解决方案3 -1 2021-01-05 14:43:58

解决方案1
3 2021-02-18 09:53:32

解决方案2
0 2020-12-19 07:51:16

解决方案3
-1 2021-01-05 14:43:58