为什么犰狳矩阵计算比Fortran慢得多

Question

I try to rewrite codes from Fortran to C++ with the matrix implements through Armadillo library. 我尝试通过Armadillo库使用矩阵实现从Fortran到C ++重写代码。 The result is the same for both codes, but the C++ code is much slower than Fortran(> 10x). 两个代码的结果相同，但C ++代码比Fortran慢（> 10x）。 The codes involve small matrix (2x2, 4x4) inverse, multiplication and addition. 代码涉及小矩阵（2x2,4x4）逆，乘法和加法。 I put a part of the similar code here for testing. 我在这里放了一部分相似的代码进行测试。

============================ ============================

clang++ cplusplus.cc -o cplusplus --std=c++14 -larmadillo -O2

ifort fort.f90 -o fort -O2

C++ code time: 0.39404s C ++代码时间：0.39404s

Fortran code time: 0.068s Fortran代码时间：0.068秒

============================ ============================

C++ code: C ++代码：

#include <armadillo>
#include <iostream>

int main()
{
  const int niter = 1580000;
  const int ns = 3;
  arma::cx_cube m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns);
  arma::wall_clock timer;
  timer.tic();
  for (auto i=0; i<niter; ++i) {
    for (auto j=0; j<ns; ++j)
      m1.slice(j) += m2.slice(j) * m3.slice(j);
  }
  double n = timer.toc();
  std::cout << "time: " << n << "s" << std::endl;
  return 0;
}

Fortran code: Fortran代码：

program main
  implicit none
  integer, parameter :: ns = 3, niter = 1580000
  complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
  integer i, j
  real :: start, finish
  call cpu_time(start)
  do i = 1, niter
     do j = 1, ns
        m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
        m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
        m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
        m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
     end do
  end do
  call cpu_time(finish)
  print *, "time: ", finish-start, " s"

end program main

==================================================================== ================================================== ==================

following @ewcz @user5713492 advice 关注@ewcz @ user5713492建议

============================ ============================

clang++ cplusplus.cc -o cplusplus --std=c++14 -larmadillo -O2

ifort fort.f90 -o fort -O2

ifort fort2.f90 -o fort2 -O2

C++ code(cplusplus.cc) time: 0.39650s C ++代码（cplusplus.cc）时间：0.39650s

Fortran code(fort.f90) (explicitly operation) time: 0.020s Fortran代码（fort.f90）（显式操作）时间：0.020s

Fortran code(fort2.f90) (matmul) time: 0.064s Fortran代码（fort2.f90）（matmul）时间：0.064s

============================ ============================

C++ code(cplusplus.cc): C ++代码（cplusplus.cc）：

#include <armadillo>
#include <iostream>
#include <complex>

int main()
{
  const int niter = 1580000;
  const int ns = 3;
  arma::cx_cube m1(2, 2, ns, arma::fill::ones),
    m2(2, 2, ns, arma::fill::ones),
    m3(2, 2, ns,arma::fill::ones);
  std::complex<double> result;
  arma::wall_clock timer;
  timer.tic();
  for (auto i=0; i<niter; ++i) {
    for (auto j=0; j<ns; ++j)
      m1.slice(j) += m2.slice(j) * m3.slice(j);
  }

  double n = timer.toc();
  std::cout << "time: " << n << "s" << std::endl;
  result = arma::accu(m1);
  std::cout << result << std::endl;
  return 0;
}

Fortran code(fort.f90): Fortran代码（fort.f90）：

program main
  implicit none
  integer, parameter :: ns = 3, niter = 1580000
  complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
  integer i, j
  complex*16 result
  real :: start, finish
  m1 = 1
  m2 = 1
  m3 = 1
  call cpu_time(start)
  do i = 1, niter
     do j = 1, ns
        m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
        m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
        m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
        m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
     end do
  end do
  call cpu_time(finish)
  result = sum(m1)
  print *, "time: ", finish-start, " s"
  print *, result

end program main

Fortran code(fort2.f90): Fortran代码（fort2.f90）：

program main
  implicit none
  integer, parameter :: ns = 3, niter = 1580000
  complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
  integer i, j
  complex*16 result
  real :: start, finish
  m1 = 1
  m2 = 1
  m3 = 1
  call cpu_time(start)
  do i = 1, niter
     do j = 1, ns
        m1(:,:,j) = m1(:,:,j)+matmul(m2(:,:,j),m3(:,:,j))
     end do
  end do
  call cpu_time(finish)
  result = sum(m1)
  print *, "time: ", finish-start, " s"
  print *, result

end program main

====================================================================== ================================================== ====================

The complex number may be one of the reasons that armadillo is so slow. 复数可能是犰狳如此缓慢的原因之一。 If I use arma::cube instead of arma::cx_cube in C++ and use real*8 in Fortran, the time is: 如果我在C ++中使用arma::cube而不是arma::cx_cube并在Fortran中使用real*8 ，那么时间是：

C++ code time: 0.08s C ++代码时间：0.08s

Fortran code(fort.f90) (explicitly operation) time: 0.012s Fortran代码（fort.f90）（显式操作）时间：0.012s

Fortran code(fort2.f90) (matmul) time: 0.028s Fortran代码（fort2.f90）（matmul）时间：0.028s

However, complex number is necessary for my computation. 但是，我的计算需要复数。 It's strange that computation time increases very large for armadillo library but a little for Fortran. 奇怪的是，犰狳图书馆的计算时间增长非常大，但对于Fortran而言则略有增加。

Answer 1

You aren't timing anything in gfortran. 你没有在gfortran中计算任何东西。 It can see at level -O2 that you don't use the value of m1 so it skips the calculation entirely. 它可以在-O2级别看到您不使用m1的值，因此它完全跳过计算。 Also in Fortran your arrays are uninitialized so you could be doing calculations with NaNs which might slow things down considerably. 同样在Fortran中，您的阵列未初始化，因此您可以使用NaN进行计算，这可能会大大减慢速度。

So you should initialize your arrays and use some kind of input like the command line, user input, or file contents so the compiler can't precompute the results. 因此，您应该初始化数组并使用某种输入，如命令行，用户输入或文件内容，以便编译器无法预先计算结果。

Then you might consider changing the loop contents in Fortran to 然后您可以考虑将Fortran中的循环内容更改为

m1(:,:,j) = m1(:,:,j)+matmul(m2(:,:,j),m3(:,:,j))

So as to be consistent with the C++ stuff. 这样才能与C ++的东西保持一致。 (gfortran seemed to slow down a lot when doing this but ifort was quite happy with it.) （gfortran在做这件事时似乎放慢了很多但是ifort对它非常满意。）

Then you MUST print out your arrays at the end so the compiler doesn't conclude that the loop you are timing can be skipped as gfortran did. 然后你必须在最后打印出你的数组，这样编译器就不会断定你正在计时的循环可以像gfortran那样被跳过。 Edit in the fixes and let us know about the new results. 编辑修复程序，让我们了解新结果。

Answer 2

I would say that your Fortran version profits significantly in this particular example from expanding the matrix multiplication explicitly into elementary operations. 我会说你的Fortran版本在这个特定的例子中从显式扩展到基本操作中获得了显着的利润。 In order to demonstrate this, let's assume following modification: 为了证明这一点，我们假设以下修改：

  implicit none
  integer, parameter :: ns = 3, niter = 1580000
  complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
  integer i, j
  real :: start, finish
  call cpu_time(start)
  m2 = 1
  m3 = 1
  do i = 1, niter
     do j = 1, ns
        !m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
        !m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
        !m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
        !m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
        m1(:, :, j) = m1(:, :, j) + MATMUL(m2(:, :, j), m3(:, :, j))
     end do
  end do
  WRITE(*, *) SUM(m1)
  call cpu_time(finish)
  print *, "time: ", finish-start, " s"

Here, at the end, the program prints the sum of m1 in order to make at least partially sure that the entire loop is not eliminated. 这里，最后，程序打印m1的总和，以便至少部分地确保不消除整个循环。 With the explicit multiplication (and -O2 ), I get running time of roughly 0.05s while with the general MATMUL it's roughly 0.2s, ie, similar to the Armadillo approach... 使用显式乘法（和-O2 ），我得到大约0.05s的运行时间，而一般MATMUL大约是0.2s，即类似于犰狳方法......

Also, even though Armadillo is heavily template based so lots of the functions calls with respect to creating the subcube views via slice() might get eliminated, you still in principle have some overhead while with Fortran, you are directly manipulating continuous chunks of memory. 此外，尽管Armadillo基于模板很多，因此通过slice()创建子多维数据集视图的许多函数调用可能会被淘汰，原则上你仍然有一些开销，而使用Fortran，你直接操作连续的内存块。

为什么犰狳矩阵计算比Fortran慢得多

问题描述

2 个解决方案

解决方案1
3 2017-12-21 08:20:01

解决方案2
2 2017-12-21 08:20:41

为什么犰狳矩阵计算比Fortran慢得多

问题描述

2 个解决方案

解决方案1 3 2017-12-21 08:20:01

解决方案2 2 2017-12-21 08:20:41

解决方案1
3 2017-12-21 08:20:01

解决方案2
2 2017-12-21 08:20:41