why armadillo matrix computation is much slower than Fortran

Question

I try to rewrite codes from Fortran to C++ with the matrix implements through Armadillo library. The result is the same for both codes, but the C++ code is much slower than Fortran(> 10x). The codes involve small matrix (2x2, 4x4) inverse, multiplication and addition. I put a part of the similar code here for testing.

============================

clang++ cplusplus.cc -o cplusplus --std=c++14 -larmadillo -O2

ifort fort.f90 -o fort -O2

C++ code time: 0.39404s

Fortran code time: 0.068s

============================

C++ code:

#include <armadillo>
#include <iostream>

int main()
{
  const int niter = 1580000;
  const int ns = 3;
  arma::cx_cube m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns);
  arma::wall_clock timer;
  timer.tic();
  for (auto i=0; i<niter; ++i) {
    for (auto j=0; j<ns; ++j)
      m1.slice(j) += m2.slice(j) * m3.slice(j);
  }
  double n = timer.toc();
  std::cout << "time: " << n << "s" << std::endl;
  return 0;
}

Fortran code:

program main
  implicit none
  integer, parameter :: ns = 3, niter = 1580000
  complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
  integer i, j
  real :: start, finish
  call cpu_time(start)
  do i = 1, niter
     do j = 1, ns
        m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
        m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
        m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
        m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
     end do
  end do
  call cpu_time(finish)
  print *, "time: ", finish-start, " s"

end program main

====================================================================

following @ewcz @user5713492 advice

============================

clang++ cplusplus.cc -o cplusplus --std=c++14 -larmadillo -O2

ifort fort.f90 -o fort -O2

ifort fort2.f90 -o fort2 -O2

C++ code(cplusplus.cc) time: 0.39650s

Fortran code(fort.f90) (explicitly operation) time: 0.020s

Fortran code(fort2.f90) (matmul) time: 0.064s

============================

C++ code(cplusplus.cc):

#include <armadillo>
#include <iostream>
#include <complex>

int main()
{
  const int niter = 1580000;
  const int ns = 3;
  arma::cx_cube m1(2, 2, ns, arma::fill::ones),
    m2(2, 2, ns, arma::fill::ones),
    m3(2, 2, ns,arma::fill::ones);
  std::complex<double> result;
  arma::wall_clock timer;
  timer.tic();
  for (auto i=0; i<niter; ++i) {
    for (auto j=0; j<ns; ++j)
      m1.slice(j) += m2.slice(j) * m3.slice(j);
  }

  double n = timer.toc();
  std::cout << "time: " << n << "s" << std::endl;
  result = arma::accu(m1);
  std::cout << result << std::endl;
  return 0;
}

Fortran code(fort.f90):

program main
  implicit none
  integer, parameter :: ns = 3, niter = 1580000
  complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
  integer i, j
  complex*16 result
  real :: start, finish
  m1 = 1
  m2 = 1
  m3 = 1
  call cpu_time(start)
  do i = 1, niter
     do j = 1, ns
        m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
        m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
        m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
        m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
     end do
  end do
  call cpu_time(finish)
  result = sum(m1)
  print *, "time: ", finish-start, " s"
  print *, result

end program main

Fortran code(fort2.f90):

program main
  implicit none
  integer, parameter :: ns = 3, niter = 1580000
  complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
  integer i, j
  complex*16 result
  real :: start, finish
  m1 = 1
  m2 = 1
  m3 = 1
  call cpu_time(start)
  do i = 1, niter
     do j = 1, ns
        m1(:,:,j) = m1(:,:,j)+matmul(m2(:,:,j),m3(:,:,j))
     end do
  end do
  call cpu_time(finish)
  result = sum(m1)
  print *, "time: ", finish-start, " s"
  print *, result

end program main

======================================================================

The complex number may be one of the reasons that armadillo is so slow. If I use arma::cube instead of arma::cx_cube in C++ and use real*8 in Fortran, the time is:

C++ code time: 0.08s

Fortran code(fort.f90) (explicitly operation) time: 0.012s

Fortran code(fort2.f90) (matmul) time: 0.028s

However, complex number is necessary for my computation. It's strange that computation time increases very large for armadillo library but a little for Fortran.

Answer 1

You aren't timing anything in gfortran. It can see at level -O2 that you don't use the value of m1 so it skips the calculation entirely. Also in Fortran your arrays are uninitialized so you could be doing calculations with NaNs which might slow things down considerably.

So you should initialize your arrays and use some kind of input like the command line, user input, or file contents so the compiler can't precompute the results.

Then you might consider changing the loop contents in Fortran to

m1(:,:,j) = m1(:,:,j)+matmul(m2(:,:,j),m3(:,:,j))

So as to be consistent with the C++ stuff. (gfortran seemed to slow down a lot when doing this but ifort was quite happy with it.)

Then you MUST print out your arrays at the end so the compiler doesn't conclude that the loop you are timing can be skipped as gfortran did. Edit in the fixes and let us know about the new results.

Answer 2

I would say that your Fortran version profits significantly in this particular example from expanding the matrix multiplication explicitly into elementary operations. In order to demonstrate this, let's assume following modification:

  implicit none
  integer, parameter :: ns = 3, niter = 1580000
  complex*16 m1(2, 2, ns), m2(2, 2, ns), m3(2, 2, ns)
  integer i, j
  real :: start, finish
  call cpu_time(start)
  m2 = 1
  m3 = 1
  do i = 1, niter
     do j = 1, ns
        !m1(1, 1, j) = m1(1, 1, j) + m2(1, 1, j) * m3(1, 1, j) + m2(1, 2, j) * m3(2, 1, j)
        !m1(1, 2, j) = m1(1, 2, j) + m2(1, 1, j) * m3(1, 2, j) + m2(1, 2, j) * m3(2, 2, j)
        !m1(2, 1, j) = m1(2, 1, j) + m2(2, 1, j) * m3(1, 1, j) + m2(2, 2, j) * m3(2, 1, j)
        !m1(2, 2, j) = m1(2, 2, j) + m2(2, 1, j) * m3(1, 2, j) + m2(2, 2, j) * m3(2, 2, j)
        m1(:, :, j) = m1(:, :, j) + MATMUL(m2(:, :, j), m3(:, :, j))
     end do
  end do
  WRITE(*, *) SUM(m1)
  call cpu_time(finish)
  print *, "time: ", finish-start, " s"

Here, at the end, the program prints the sum of m1 in order to make at least partially sure that the entire loop is not eliminated. With the explicit multiplication (and -O2 ), I get running time of roughly 0.05s while with the general MATMUL it's roughly 0.2s, ie, similar to the Armadillo approach...

Also, even though Armadillo is heavily template based so lots of the functions calls with respect to creating the subcube views via slice() might get eliminated, you still in principle have some overhead while with Fortran, you are directly manipulating continuous chunks of memory.

why armadillo matrix computation is much slower than Fortran

Question

2 answers

solution1
3 2017-12-21 08:20:01

solution2
2 2017-12-21 08:20:41

why armadillo matrix computation is much slower than Fortran

Question

2 answers

solution1 3 2017-12-21 08:20:01

solution2 2 2017-12-21 08:20:41

solution1
3 2017-12-21 08:20:01

solution2
2 2017-12-21 08:20:41