自制数组缩减的性能异常

Question

the code below implements two array class, 1 dimensional and 2 dimensional (column major order), and a clock for getting wall clock time.下面的代码实现了两个数组 class，一维和二维（列主顺序），以及一个用于获取挂钟时间的时钟。

The function of concern is a reduction of the 2d array into a 1d array via a lambda call back, either along the rows or along the columns.关注的 function 是通过 lambda 回调，沿行或沿列将二维数组缩减为一维数组。 In both cases the 2d array is traversed in the same order.在这两种情况下，二维数组都以相同的顺序遍历。 However, dropping the row dimension needs almost twice as much time as dropping the column dimension, which is unclear to me because the major performance driver should be traversing the 2d array.但是，删除行维度所需的时间几乎是删除列维度的两倍，我不清楚这一点，因为主要的性能驱动因素应该是遍历二维数组。

#include <iostream>
#include <string>
#include <chrono>
#define i64 long long int
using namespace std;
class hdclock{
private:
  std::chrono::time_point<std::chrono::high_resolution_clock> start;
public:
  void tic(){
    this->start=std::chrono::high_resolution_clock::now();
  };
  double toc(){
    auto end=std::chrono::high_resolution_clock::now();
    auto duration = duration_cast<std::chrono::milliseconds>(end - this->start);
    return((double)duration.count()/1000.0);
  }
};
template<class T> class arr1d;
template<class T> class arr2d;
template<class T>
class base{
protected:
  i64 nelements=0;
  T * val=nullptr;
public:
  base(i64 nelements){
    this->nelements=nelements;
    this->val=(T*)malloc(sizeof(T)*this->nelements);
    for(i64 i=0;i<this->nelements;++i){(*this)(i)=(T)i;}
  }
  virtual ~base(){free(this->val);}
  const T& operator()(i64 i)const{return(this->val[i]);}
  T& operator()(i64 i){return(this->val[i]);}
  const i64& size()const{return(this->nelements);}
};
enum class drop{rows,columns};
template<class T>
class arr1d:public base<T>{
protected:
  i64 d1=0;
public:
  arr1d(i64 d1):base<T>(d1){this->d1=d1;};
  ~arr1d(){};
  template<typename F>
  arr1d& reduction(const arr2d<T> &ii,F f,const drop which);
};
template<class T>
class arr2d:public base<T>{
protected:
  i64 d1=0,d2=0;
public:
  arr2d(i64 d1, i64 d2):base<T>(d1*d2){this->d1=d1;this->d2=d2;};
  ~arr2d(){};
  const T& operator()(i64 i,i64 j)const{return(this->val[j*this->d1+i]);}
  T& operator()(i64 i,i64 j){return(this->val[j*this->d1+i]);}
  const i64& size(i64 i)const{if(i==1){return(d1);}else{return(d2);}}
};
//@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
template<typename T> template<typename F>
arr1d<T>& arr1d<T>::reduction(const arr2d<T> &ii,F f,const drop which){
  switch(which){
  case drop::rows:
    if(this->d1!=ii.size(2)){string msg="err";throw msg;}
    for(i64 i=0;i<ii.size(2);++i){
      for(i64 j=0;j<ii.size(1);++j){
    f((*this)(i),ii(j,i));
      }
    }
    break;
  case drop::columns:
    if(this->d1!=ii.size(1)){string msg="err";throw msg;}
    for(i64 i=0;i<ii.size(2);++i){
      for(i64 j=0;j<ii.size(1);++j){
    f((*this)(j),ii(j,i));
      }
    }
    break;
  }
  return *this;
}
int main(){
  arr2d<double> x(70000,70000);
  arr1d<double> y(70000);
  hdclock t;
  try{
    t.tic();
    y.reduction(x,[](double &a, const double &b){a+=b;},drop::columns);
    cout<<t.toc()<<endl;
    for(i64 i=0;i<y.size();++i){y(i)=0.0;}
    t.tic();
    y.reduction(x,[](double &a, const double &b){a+=b;},drop::rows);
    cout<<t.toc()<<endl;
  }catch(string msg){
    cout<<msg<<endl;return(1);
  }
  return(0);
}

Compiled with clang++ 12.01 or g++ 11.1 with flags -std=c++20 -O3 dropping columns needed 2.2 seconds, dropping rows needed 4.5 seconds (intel i9-9980HK, 64GB RAM).使用clang++ 12.01或g++ 11.1编译，带有标志-std=c++20 -O3删除列需要 2.2 秒，删除行需要 4.5 秒（英特尔 i9-9980HK，64GB RAM）。

Any suggestions/explanations for the performance difference and possible solution for speeding up the slower are highly appreciated.非常感谢任何关于性能差异的建议/解释以及加快速度较慢的可能解决方案。

Thanks and best regards谢谢和最好的问候

Answer 1

g++ -O3 -std=c++20 -fopt-info-vec-all gives some insight and it appears that dropping rows doesn't allow for vectorization, but no reason is provided. g++ -O3 -std=c++20 -fopt-info-vec-all给出了一些见解，似乎删除行不允许矢量化，但没有提供原因。

However, clang++ -O3 -std=c++20 -Rpass-analysis=loop-vectorize is more helpful by providing remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'. [-Rpass-analysis=loop-vectorize] y.reduction(x,[](double &a, const double &b){a+=b;},drop::rows);但是， clang++ -O3 -std=c++20 -Rpass-analysis=loop-vectorize通过提供remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'. [-Rpass-analysis=loop-vectorize] y.reduction(x,[](double &a, const double &b){a+=b;},drop::rows); remark: loop not vectorized: cannot prove it is safe to reorder floating-point operations; allow reordering by specifying '#pragma clang loop vectorize(enable)' before the loop or by providing the compiler option '-ffast-math'. [-Rpass-analysis=loop-vectorize] y.reduction(x,[](double &a, const double &b){a+=b;},drop::rows);

Indeed adding -ffast-math to the compiler options reverses the speed to 2.25 seconds for dropping columns and 1.5 seconds for dropping rows.实际上，将-ffast-math添加到编译器选项会将删除列的速度反转为 2.25 秒，删除行的速度反转为 1.5 秒。

Answer 2

It's not trivial to find out the exact reasons for this behavior, as they're dependant on a lot of different factors.找出这种行为的确切原因并非易事，因为它们取决于许多不同的因素。 My best advice is to look at the assembly.我最好的建议是查看程序集。 Intel VTune is great to understand what's going on inside the CPU. Intel VTune 非常适合了解 CPU 内部发生的情况。

I can speculate about two possible reasons for this difference:我可以推测造成这种差异的两个可能原因：

Different vectorization behaviour of the compiler.编译器的不同矢量化行为。 The compiler might have generated efficient vectorized code for one case and not the other.编译器可能已经为一种情况生成了高效的矢量化代码，而对于另一种情况则没有。 You should look at the assembly and see if that might be the case.您应该查看程序集，看看是否可能是这种情况。
Long dependecy chains.长依赖链。 In the rows case, you're summing into one cell at a time.在行的情况下，您一次汇总到一个单元格中。 That means that your additions are all dependant on the previous additions.这意味着你的添加都依赖于之前的添加。 That can prevent the CPU from parallel execution of those additions.这可以防止 CPU 并行执行这些添加。 (Modern Intel CPUs can do something like 4 additions in one clock). （现代英特尔 CPU 可以在一个时钟内执行 4 次加法）。

Also, have you tried using -march=native?另外，您是否尝试过使用 -march=native？

自制数组缩减的性能异常

问题描述

2 个解决方案

解决方案1
1 2021-10-01 13:12:28

解决方案2
0 2021-10-01 09:04:14

自制数组缩减的性能异常

问题描述

2 个解决方案

解决方案1 1 2021-10-01 13:12:28

解决方案2 0 2021-10-01 09:04:14

解决方案1
1 2021-10-01 13:12:28

解决方案2
0 2021-10-01 09:04:14