加速本征c++转置？

Question

I know that this 'eigen speed-up' questions arise regularly but after reading many of them and trying several flags I cannot get a better time with c++ eigen comparing with the traditional way of performing a transpose.我知道这种“特征加速”问题经常出现，但在阅读了许多问题并尝试了几个标志之后，与执行转置的传统方式相比，我无法获得更好的 c++ 特征时间。 Actually using blocking is much more efficient.实际上使用阻塞更有效。 The following is the code以下是代码

#include <cstdio>
#include <ctime>
#include <cstdlib>
#include <iostream>
#include <Eigen/Dense>

#define min( a, b ) ( ((a) < (b)) ? (a) : (b) )

int main(){
    const int n = 10000;
    const int csize = 32;
    float **a, **b;
    clock_t cputime1, cputime2;
    int i,j,k,ii,jj,kk;
  
    // Allocating memory for array/matrix
    a = new float * [n];
    for (i=0; i<n; i++){
        a[i] = new float [n];
    }
    b = new float * [n];
    for (i=0; i<n; i++){
        b[i] = new float[n];
    }
    // eigen matrices
    Eigen::MatrixXf M1 = Eigen::MatrixXf::Constant(n, n, 0.0);
    Eigen::MatrixXf M2 = Eigen::MatrixXf::Constant(n, n, 0.0);
    
    // Filling matrices with zeros
    for(i=0; i<n; ++i)
        for (j=0; j<n; ++j)
            a[i][j] = 0;
    for(i=0; i<n; ++i)
        for (j=0; j<n; ++j)
            b[i][j] = 0;

    // Direct (inefficient) transposition
    cputime1 = clock();
    for (i=0; i<n; ++i)
        for (j=0; j<n; ++j)
            a[i][j] = b[j][i];
    cputime2 = clock() - cputime1;
    std::printf("Time for transposition: %f\n", ((double)cputime2)/CLOCKS_PER_SEC);

    // Transposition using cache-blocking
    cputime1 = clock();
    for (ii=0; ii<n; ii+=csize)
        for (jj=0; jj<n; jj+=csize)
            for (i=ii; i<min(n,ii+csize-1); ++i)
                for (j=jj; j<min(n,jj+csize-1); ++j)
                    a[i][j] = b[j][i];
    cputime2 = clock() - cputime1;
    std::printf("Time for transposition: %f\n", ((double)cputime2)/CLOCKS_PER_SEC);

    // eigen
    cputime1 = clock();
    M1.noalias() = M2.transpose();
    cputime2 = clock() - cputime1;
    std::printf("Time for transposition with eigen: %f\n", ((double)cputime2)/CLOCKS_PER_SEC);

    // use data
    std::cout << a[n/2][n/2] << std::endl;
    std::cout << b[n/2][n/2] << std::endl;
    std::cout << M1(n/2,n/2) << std::endl;

    return 0;
}

And the compiling command I am using is我正在使用的编译命令是

g++ -fno-math-errno -ffast-math -march=native -fopenmp -O2 -msse2 -DNDEBUG  blocking_and_eigen.cpp

with results结果

Time for transposition: 1.926674
Time for transposition: 0.280653
Time for transposition with eigen: 2.018217

I am using eigen 3.4.0, and g++ 11.2.0.我正在使用本征 3.4.0 和 g++ 11.2.0。

Do you have any suggestion to improve eigen performance?您对提高本征性能有什么建议吗？ Thanks in advance提前致谢

Answer 1

As suggested by INS in the comment is the actual copying of the matrix causing the performance drop, I slightly modify your example to use some numbers instead of all zeros (to avoid any type of optimisation):正如 INS 在评论中所建议的那样，实际复制矩阵会导致性能下降，我稍微修改了您的示例以使用一些数字而不是全零（以避免任何类型的优化）：

for(i=0; i<n; ++i) {
    for (j=0; j<n; ++j) {
        a[i][j] = i+j;
        M1(i,j) = i+j;
      }
}
for(i=0; i<n; ++i) {
    for (j=0; j<n; ++j) {
        b[i][j] = i+j;
        M1(i,j) = i+j;
    }
}

Also, I modify the final printing statement with a full check over the result (when not in place the check will be performed against M2):此外，我修改了最终的打印语句，对结果进行了全面检查（如果没有，将对 M2 执行检查）：

    for (i=0; i<n; ++i)
    for (j=0; j<n; ++j)
      if (a[i][j] != M1(i,j))
        std::cout << "Diff here! " << std::endl;

Then I tried several tests:然后我尝试了几个测试：

Preallocation and assignment预分配和分配
Eigen::MatrixXf M2 = Eigen::MatrixXf::Constant(n, n, 0.0);本征::MatrixXf M2 = 本征::MatrixXf::Constant(n, n, 0.0); ... some code here... M2 = M1.transpose(); ...这里有一些代码... M2 = M1.transpose();
Copy constructor复制构造函数
Eigen::MatrixXf M2(M1.transpose());特征::MatrixXf M2(M1.transpose());
in place到位
M1.transposeInPlace(); M1.transposeInPlace();
copy construct using auto and c++17使用 auto 和 c++17 复制构造
auto M2{ M1.transpose() };自动 M2{ M1.transpose() };

This is the most puzzling, the performance are outstanding, I think there are two part in the story, if I print the typeid name of M2 for case 2 and 4 they are different, and the name is mangled but it give us a clue:这是最令人费解的，表现很出色，我认为故事有两个部分，如果我打印案例 2 和案例 4 的 M2 的 typeid 名称它们是不同的，并且名称被损坏但它给了我们一个线索：

N5 Eigen 6 Matrix IfLin1ELin1ELi0ELin1ELin1EEE N5 Eigen 9 Transpose INS_6 Matrix IfLin1ELin1ELi0ELin1ELin1EEEEE N5 Eigen 6矩阵IfLin1ELin1ELi0ELin1ELin1EEE N5 Eigen 9转置 INS_6矩阵IfLin1ELin1ELi0ELin1ELin1EEEEE

auto keyword resolve to a different type specific for transpose matrix. auto 关键字解析为特定于转置矩阵的不同类型。 The second part of the story is the fact that M1 is not modify afterwards, so either the compiler moves it or, most likely the EigenTransposeMatrix ( https://eigen.tuxfamily.org/dox/classEigen_1_1Transpose.html ) is only keeping a reference of the original matrix and it doesn't copy it.故事的第二部分是 M1 之后没有修改的事实，所以编译器移动它，或者很可能 EigenTransposeMatrix （ https://eigen.tuxfamily.org/dox/classEigen_1_1Transpose.html ）只保留一个参考原始矩阵，它不会复制它。

Results结果

Test测试	Direct (s)直接	Cache block (s)缓存块	eigen (s)本征
1 1	2.633 2.633	0.312 0.312	1.861 1.861
2 2	2.599 2.599	0.262 0.262	1.968 1.968
3 3	2.602 2.602	0.262 0.262	0.216 0.216
4 4	2.552 2.552	0.280 0.280	0.000002 0.000002

加速本征c++转置？

问题描述

1 个解决方案

解决方案1
0 2021-11-21 11:50:38

加速本征c++转置？

问题描述

1 个解决方案

解决方案1 0 2021-11-21 11:50:38

解决方案1
0 2021-11-21 11:50:38