Eigen 在乘法小矩阵方面很慢吗？

Question

I wrote a function that multiplies Eigen matrices of dimension 10x10 together.我写了一个 function 将维度为 10x10 的特征矩阵相乘在一起。 Then I wrote a naive multiply function CustomMultiply which was surprisingly 2x faster than Eigen's implementation.然后我写了一个简单的乘法 function CustomMultiply ，它比 Eigen 的实现快了惊人的 2 倍。

I tried a couple of different compilation flags like -O2 and -O3, which did not make a difference.我尝试了几个不同的编译标志，如 -O2 和 -O3，但没有任何区别。

  #include <Eigen/Core>

  constexpr int dimension = 10;
  using Matrix = Eigen::Matrix<double, dimension, dimension>;

  Matrix CustomMultiply(const Matrix& a, const Matrix& b) {
    Matrix result = Matrix::Zero();
    for (int bcol_idx = 0; bcol_idx < dimension; ++bcol_idx) {
      for (int brow_idx = 0; brow_idx < dimension; ++brow_idx) {
        result.col(bcol_idx).noalias() += a.col(brow_idx) * b(brow_idx, bcol_idx);
      }
    }
    return result;
  }

  Matrix PairwiseMultiplyEachMatrixNoAlias(int num_repetitions, const std::vector<Matrix>& input) {
    Matrix acc = Matrix::Zero();
    for (int i = 0; i < num_repetitions; ++i) {
      for (const auto& matrix_a : input) {
        for (const auto& matrix_b : input) {
          acc.noalias() += matrix_a * matrix_b;
        }
      }
    }
    return acc;
  }

  Matrix PairwiseMultiplyEachMatrixCustom(int num_repetitions, const std::vector<Matrix>& input) {
    Matrix acc = Matrix::Zero();
    for (int i = 0; i < num_repetitions; ++i) {
      for (const auto& matrix_a : input) {
        for (const auto& matrix_b : input) {
          acc.noalias() += CustomMultiply(matrix_a, matrix_b);
        }
      }
    }
    return acc;
  }

PairwiseMultiplyEachMatrixNoAlias is 2x slower on PairwiseMultiplyEachMatrixCustom on my machine when I pass in 100 random matrices as input and use 100 as num_repetitions .当我传入 100 个随机矩阵作为input并使用 100 作为num_repetitions时， PairwiseMultiplyEachMatrixCustom在我的机器上的PairwiseMultiplyEachMatrixNoAlias上慢 2 倍。 My machine details: Intel Xeon CPU E5-2630 v4, Ubuntu 16.04, Eigen 3我的机器详细信息：Intel Xeon CPU E5-2630 v4，Ubuntu 16.04，Eigen 3

Updates: Results are unchanged after the following modifications after helpful discussion in the comments更新：在评论中有帮助的讨论后，经过以下修改后结果不变

num_repetitions = 1 and input.size() = 1000 num_repetitions = 1和input.size() = 1000
using .lazyProduct() and using .eval() actually leads to further slowdown使用.lazyProduct()和使用.eval()实际上会导致进一步放缓
clang 8.0.0 clang 8.0.0
g++ 9.2 g++ 9.2
using flags -march=native -DNDEBUG使用标志-march=native -DNDEBUG

Updates 2:更新 2：
Following up on @dtell's findings with Google Benchmark library, I found an interesting result.使用 Google Benchmark 库跟进 @dtell 的发现后，我发现了一个有趣的结果。 Multiplying 2 matrices with Eigen is faster than custom, but multiplying many matrices with Eigen is 2x slower, in line with the previous findings.将 2 个矩阵与 Eigen 相乘比自定义要快，但将许多矩阵与 Eigen 相乘要慢 2 倍，这与之前的发现一致。

Here is my Google Benchmark code.这是我的谷歌基准代码。 (Note: There was an off-by one in the GenerateRandomMatrices() function below which is now fixed.) （注意：下面的GenerateRandomMatrices() function 中有一个偏移，现在已修复。）

#include <Eigen/Core>
#include <Eigen/StdVector>
#include <benchmark/benchmark.h>

constexpr int dimension = 10;
constexpr int num_random_matrices = 10;
using Matrix = Eigen::Matrix<double, dimension, dimension>;
using Eigen_std_vector = std::vector<Matrix,Eigen::aligned_allocator<Matrix>>;

Eigen_std_vector GetRandomMatrices(int num_matrices) {
  Eigen_std_vector matrices;
  for (int i = 0; i < num_matrices; ++i) {
    matrices.push_back(Matrix::Random());
  }
  return matrices;
}

Matrix CustomMultiply(const Matrix& a, const Matrix& b) {
  Matrix result = Matrix::Zero();
  for (int bcol_idx = 0; bcol_idx < dimension; ++bcol_idx) {
    for (int brow_idx = 0; brow_idx < dimension; ++brow_idx) {
      result.col(bcol_idx).noalias() += a.col(brow_idx) * b(brow_idx, bcol_idx);
    }
  }
  return result;
}

Matrix PairwiseMultiplyEachMatrixNoAlias(int num_repetitions, const Eigen_std_vector& input) {
  Matrix acc = Matrix::Zero();
  for (int i = 0; i < num_repetitions; ++i) {
    for (const auto& matrix_a : input) {
      for (const auto& matrix_b : input) {
        acc.noalias() += matrix_a * matrix_b;
      }
    }
  }
  return acc;
}

Matrix PairwiseMultiplyEachMatrixCustom(int num_repetitions, const Eigen_std_vector& input) {
  Matrix acc = Matrix::Zero();
  for (int i = 0; i < num_repetitions; ++i) {
    for (const auto& matrix_a : input) {
      for (const auto& matrix_b : input) {
        acc.noalias() += CustomMultiply(matrix_a, matrix_b);
      }
    }
  }
  return acc;
}

void BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(num_random_matrices);
  for (auto _ : state) {
    benchmark::DoNotOptimize(PairwiseMultiplyEachMatrixNoAlias(1, random_matrices));
  }
}
BENCHMARK(BM_PairwiseMultiplyEachMatrixNoAlias);


void BM_PairwiseMultiplyEachMatrixCustom(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(num_random_matrices);
  for (auto _ : state) {
    benchmark::DoNotOptimize(PairwiseMultiplyEachMatrixCustom(1, random_matrices));
  }
}
BENCHMARK(BM_PairwiseMultiplyEachMatrixCustom);

void BM_MultiplySingle(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(2);
  for (auto _ : state) {
    benchmark::DoNotOptimize((random_matrices[0] * random_matrices[1]).eval());
  }
}
BENCHMARK(BM_MultiplySingle);

void BM_MultiplySingleCustom(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(2);
  for (auto _ : state) {
    benchmark::DoNotOptimize(CustomMultiply(random_matrices[0], random_matrices[1]));
  }
}
BENCHMARK(BM_MultiplySingleCustom);



double TestCustom() {
  const Matrix a = Matrix::Random();
  const Matrix b = Matrix::Random();

  const Matrix c = a * b;
  const Matrix custom_c = CustomMultiply(a, b);

  const double err = (c - custom_c).squaredNorm();
  return err;
}

// Just sanity check the multiplication
void BM_TestCustom(benchmark::State& state) {
  if (TestCustom() > 1e-10) {
    exit(-1);
  }
}
BENCHMARK(BM_TestCustom);

This yields the following mysterious report这产生了以下神秘的报告

Run on (20 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 25600K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_PairwiseMultiplyEachMatrixNoAlias      28283 ns      28285 ns      20250
BM_PairwiseMultiplyEachMatrixCustom       14442 ns      14443 ns      48488
BM_MultiplySingle                           791 ns        791 ns     876969
BM_MultiplySingleCustom                     874 ns        874 ns     802052
BM_TestCustom                                 0 ns          0 ns          0

My current hypothesis is that the slowdown is attributable to instruction cache misses.我目前的假设是减速是由于指令缓存未命中。 It's possible Eigen's matrix multiply function does bad things to the instruction cache. Eigen 的矩阵乘法 function 可能会对指令缓存造成不良影响。

VTune output for custom: VTune output 用于自定义：

VTune output for Eigen: VTune output 用于本征：

Maybe someone with more experience with VTune can tell me if I am interpreting this result correctly.也许对 VTune 有更多经验的人可以告诉我我是否正确解释了这个结果。 The DSB is the decoded instruction cache and MITE has something to do with instruction decoder bandwidth. DSB 是解码后的指令缓存，而 MITE 与指令解码器带宽有关。 The Eigen version shows that most instructions are missing the DSB (66% miss rate) and a marked increase in stalling due to MITE bandwidth. Eigen 版本显示大多数指令都缺少 DSB（66% 的未命中率），并且由于 MITE 带宽而导致的停顿显着增加。

Update 3: After getting reports that the single version of custom was faster, I also reproduced it on my machine.更新 3：收到报告说自定义的单版本更快后，我也在我的机器上复制了它。 This goes against @dtell's original findings on their machine.这与@dtell 在他们机器上的原始发现背道而驰。

CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 25600K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_PairwiseMultiplyEachMatrixNoAlias      34787 ns      34789 ns      16477
BM_PairwiseMultiplyEachMatrixCustom       17901 ns      17902 ns      37759
BM_MultiplySingle                           349 ns        349 ns    2054295
BM_MultiplySingleCustom                     178 ns        178 ns    4624183
BM_TestCustom                                 0 ns          0 ns          0

I wonder if in my previous benchmark result I had left out an optimization flag.我想知道在我之前的基准测试结果中是否遗漏了优化标志。 In any case, I think the issue is confirmed that Eigen incurs an overhead when multiplying small matrices.无论如何，我认为这个问题已经得到证实，即 Eigen 在乘以小矩阵时会产生开销。 If anyone out there has a machine that does not use a uop cache, I would be interested in seeing if the slowdown is less severe.如果那里有人有一台不使用 uop 缓存的机器，我很想看看减速是否不那么严重。

Answer 1

(gdb) bt
#0  0x00005555555679e3 in Eigen::internal::gemm_pack_rhs<double, long, Eigen::internal::const_blas_data_mapper<double, long, 0>, 4, 0, false, false>::operator()(double*, Eigen::internal::const_blas_data_mapper<double, long, 0> const&, long, long, long, long) ()
#1  0x0000555555566654 in Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>::run(long, long, long, double const*, long, double const*, long, double*, long, double, Eigen::internal::level3_blocking<double, double>&, Eigen::internal::GemmParallelInfo<long>*) ()
#2  0x0000555555565822 in BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State&) ()
#3  0x000055555556d571 in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::Benchmark::Instance const*, unsigned long, int, benchmark::internal::ThreadManager*) ()
#4  0x000055555556b469 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#5  0x000055555556a450 in main ()

From stack trace, eigen's matrix multiplication is using a generic multiply method and loop through a dynamic matrix size.从堆栈跟踪来看，本征矩阵乘法使用通用乘法方法并循环通过动态矩阵大小。 For custom implementation, clang aggressively vectorize it and unroll loop, so there's much less branching.对于自定义实现，clang 积极地对其进行矢量化并展开循环，因此分支要少得多。

Maybe there's some flag/option for eigen to generate code for this particular size to optimize.也许 eigen 有一些标志/选项可以为这个特定大小生成代码以进行优化。

However, if the matrix size is bigger, the Eigen version will perform much better than custom.但是，如果矩阵大小更大，Eigen 版本的性能将比自定义版本好得多。

Answer 2

I've rewritten your code using a proper benchmark library, namely Google Benchmark and cannot reproduce your measurements.我已经使用适当的基准库（即Google Benchmark ）重写了您的代码，并且无法重现您的测量结果。

My results for -O0 where the second template parameter is the matrix dimension:我的-O0结果，其中第二个模板参数是矩阵维度：

Running ./benchmark
Run on (12 X 2900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 262K (x6)
  L3 Unified 12582K (x1)
---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_CustomMultiply<double, 3>        5391 ns       5389 ns     105066
BM_CustomMultiply<double, 4>        9365 ns       9364 ns      73649
BM_CustomMultiply<double, 5>       15349 ns      15349 ns      44008
BM_CustomMultiply<double, 6>       20953 ns      20947 ns      32230
BM_CustomMultiply<double, 7>       33328 ns      33318 ns      21584
BM_CustomMultiply<double, 8>       44237 ns      44230 ns      15500
BM_CustomMultiply<double, 9>       57142 ns      57140 ns      11953
BM_CustomMultiply<double, 10>      69382 ns      69382 ns       9998
BM_EigenMultiply<double, 3>         2335 ns       2335 ns     295458
BM_EigenMultiply<double, 4>         1613 ns       1613 ns     457382
BM_EigenMultiply<double, 5>         4791 ns       4791 ns     142992
BM_EigenMultiply<double, 6>         3471 ns       3469 ns     206002
BM_EigenMultiply<double, 7>         9052 ns       9051 ns      78135
BM_EigenMultiply<double, 8>         8655 ns       8655 ns      81717
BM_EigenMultiply<double, 9>        11446 ns      11399 ns      67001
BM_EigenMultiply<double, 10>       15092 ns      15053 ns      46924

As you can see the number of iterations Google Benchmark uses is order of magnitudes higher that your benchmark.如您所见，Google Benchmark 使用的迭代次数比您的基准高几个数量级。 Micro-benchmarking is extremely hard especially when you deal with execution times of a few hundred nanoseconds.微基准测试非常困难，尤其是在处理几百纳秒的执行时间时。

To be fair, calling your custom function involves a copy and manually inlining it gives a few nanoseconds, but still not beating Eigen.公平地说，调用您的自定义 function 涉及复制和手动内联它会产生几纳秒的时间，但仍然没有击败 Eigen。

Measurement with manually inlined CustomMultiply and -O2 -DNDEBUG -march=native :使用手动内联CustomMultiply和-O2 -DNDEBUG -march=native进行测量：

Running ./benchmark
Run on (12 X 2900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 262K (x6)
  L3 Unified 12582K (x1)
---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_CustomMultiply<double, 3>          51 ns         51 ns   11108114
BM_CustomMultiply<double, 4>          88 ns         88 ns    7683611
BM_CustomMultiply<double, 5>         147 ns        147 ns    4642341
BM_CustomMultiply<double, 6>         213 ns        213 ns    3205627
BM_CustomMultiply<double, 7>         308 ns        308 ns    2246391
BM_CustomMultiply<double, 8>         365 ns        365 ns    1904860
BM_CustomMultiply<double, 9>         556 ns        556 ns    1254953
BM_CustomMultiply<double, 10>        661 ns        661 ns    1027825
BM_EigenMultiply<double, 3>           39 ns         39 ns   17918807
BM_EigenMultiply<double, 4>           69 ns         69 ns    9931755
BM_EigenMultiply<double, 5>          119 ns        119 ns    5801185
BM_EigenMultiply<double, 6>          178 ns        178 ns    3838772
BM_EigenMultiply<double, 7>          256 ns        256 ns    2692898
BM_EigenMultiply<double, 8>          385 ns        385 ns    1826598
BM_EigenMultiply<double, 9>          546 ns        546 ns    1271687
BM_EigenMultiply<double, 10>         644 ns        644 ns    1104798

Eigen 在乘法小矩阵方面很慢吗？

问题描述

2 个解决方案

解决方案1
3 2019-09-25 21:53:28

解决方案2
1 2019-09-24 20:53:17

Eigen 在乘法小矩阵方面很慢吗？

问题描述

2 个解决方案

解决方案1 3 2019-09-25 21:53:28

解决方案2 1 2019-09-24 20:53:17

解决方案1
3 2019-09-25 21:53:28

解决方案2
1 2019-09-24 20:53:17