[英]Is Eigen slow at multiplying small matrices?
I wrote a function that multiplies Eigen matrices of dimension 10x10 together.我写了一个 function 将维度为 10x10 的特征矩阵相乘在一起。 Then I wrote a naive multiply function
CustomMultiply
which was surprisingly 2x faster than Eigen's implementation.然后我写了一个简单的乘法 function
CustomMultiply
,它比 Eigen 的实现快了惊人的 2 倍。
I tried a couple of different compilation flags like -O2 and -O3, which did not make a difference.我尝试了几个不同的编译标志,如 -O2 和 -O3,但没有任何区别。
#include <Eigen/Core>
constexpr int dimension = 10;
using Matrix = Eigen::Matrix<double, dimension, dimension>;
Matrix CustomMultiply(const Matrix& a, const Matrix& b) {
Matrix result = Matrix::Zero();
for (int bcol_idx = 0; bcol_idx < dimension; ++bcol_idx) {
for (int brow_idx = 0; brow_idx < dimension; ++brow_idx) {
result.col(bcol_idx).noalias() += a.col(brow_idx) * b(brow_idx, bcol_idx);
}
}
return result;
}
Matrix PairwiseMultiplyEachMatrixNoAlias(int num_repetitions, const std::vector<Matrix>& input) {
Matrix acc = Matrix::Zero();
for (int i = 0; i < num_repetitions; ++i) {
for (const auto& matrix_a : input) {
for (const auto& matrix_b : input) {
acc.noalias() += matrix_a * matrix_b;
}
}
}
return acc;
}
Matrix PairwiseMultiplyEachMatrixCustom(int num_repetitions, const std::vector<Matrix>& input) {
Matrix acc = Matrix::Zero();
for (int i = 0; i < num_repetitions; ++i) {
for (const auto& matrix_a : input) {
for (const auto& matrix_b : input) {
acc.noalias() += CustomMultiply(matrix_a, matrix_b);
}
}
}
return acc;
}
PairwiseMultiplyEachMatrixNoAlias
is 2x slower on PairwiseMultiplyEachMatrixCustom
on my machine when I pass in 100 random matrices as input
and use 100 as num_repetitions
.当我传入 100 个随机矩阵作为
input
并使用 100 作为num_repetitions
时, PairwiseMultiplyEachMatrixCustom
在我的机器上的PairwiseMultiplyEachMatrixNoAlias
上慢 2 倍。 My machine details: Intel Xeon CPU E5-2630 v4, Ubuntu 16.04, Eigen 3我的机器详细信息:Intel Xeon CPU E5-2630 v4,Ubuntu 16.04,Eigen 3
Updates: Results are unchanged after the following modifications after helpful discussion in the comments更新:在评论中有帮助的讨论后,经过以下修改后结果不变
num_repetitions = 1
and input.size() = 1000
num_repetitions = 1
和input.size() = 1000
.lazyProduct()
and using .eval()
actually leads to further slowdown.lazyProduct()
和使用.eval()
实际上会导致进一步放缓-march=native -DNDEBUG
-march=native -DNDEBUG
Updates 2:更新 2:
Following up on @dtell's findings with Google Benchmark library, I found an interesting result.使用 Google Benchmark 库跟进 @dtell 的发现后,我发现了一个有趣的结果。 Multiplying 2 matrices with Eigen is faster than custom, but multiplying many matrices with Eigen is 2x slower, in line with the previous findings.
将 2 个矩阵与 Eigen 相乘比自定义要快,但将许多矩阵与 Eigen 相乘要慢 2 倍,这与之前的发现一致。
Here is my Google Benchmark code.这是我的谷歌基准代码。 (Note: There was an off-by one in the
GenerateRandomMatrices()
function below which is now fixed.) (注意:下面的
GenerateRandomMatrices()
function 中有一个偏移,现在已修复。)
#include <Eigen/Core>
#include <Eigen/StdVector>
#include <benchmark/benchmark.h>
constexpr int dimension = 10;
constexpr int num_random_matrices = 10;
using Matrix = Eigen::Matrix<double, dimension, dimension>;
using Eigen_std_vector = std::vector<Matrix,Eigen::aligned_allocator<Matrix>>;
Eigen_std_vector GetRandomMatrices(int num_matrices) {
Eigen_std_vector matrices;
for (int i = 0; i < num_matrices; ++i) {
matrices.push_back(Matrix::Random());
}
return matrices;
}
Matrix CustomMultiply(const Matrix& a, const Matrix& b) {
Matrix result = Matrix::Zero();
for (int bcol_idx = 0; bcol_idx < dimension; ++bcol_idx) {
for (int brow_idx = 0; brow_idx < dimension; ++brow_idx) {
result.col(bcol_idx).noalias() += a.col(brow_idx) * b(brow_idx, bcol_idx);
}
}
return result;
}
Matrix PairwiseMultiplyEachMatrixNoAlias(int num_repetitions, const Eigen_std_vector& input) {
Matrix acc = Matrix::Zero();
for (int i = 0; i < num_repetitions; ++i) {
for (const auto& matrix_a : input) {
for (const auto& matrix_b : input) {
acc.noalias() += matrix_a * matrix_b;
}
}
}
return acc;
}
Matrix PairwiseMultiplyEachMatrixCustom(int num_repetitions, const Eigen_std_vector& input) {
Matrix acc = Matrix::Zero();
for (int i = 0; i < num_repetitions; ++i) {
for (const auto& matrix_a : input) {
for (const auto& matrix_b : input) {
acc.noalias() += CustomMultiply(matrix_a, matrix_b);
}
}
}
return acc;
}
void BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State& state) {
// Perform setup here
const auto random_matrices = GetRandomMatrices(num_random_matrices);
for (auto _ : state) {
benchmark::DoNotOptimize(PairwiseMultiplyEachMatrixNoAlias(1, random_matrices));
}
}
BENCHMARK(BM_PairwiseMultiplyEachMatrixNoAlias);
void BM_PairwiseMultiplyEachMatrixCustom(benchmark::State& state) {
// Perform setup here
const auto random_matrices = GetRandomMatrices(num_random_matrices);
for (auto _ : state) {
benchmark::DoNotOptimize(PairwiseMultiplyEachMatrixCustom(1, random_matrices));
}
}
BENCHMARK(BM_PairwiseMultiplyEachMatrixCustom);
void BM_MultiplySingle(benchmark::State& state) {
// Perform setup here
const auto random_matrices = GetRandomMatrices(2);
for (auto _ : state) {
benchmark::DoNotOptimize((random_matrices[0] * random_matrices[1]).eval());
}
}
BENCHMARK(BM_MultiplySingle);
void BM_MultiplySingleCustom(benchmark::State& state) {
// Perform setup here
const auto random_matrices = GetRandomMatrices(2);
for (auto _ : state) {
benchmark::DoNotOptimize(CustomMultiply(random_matrices[0], random_matrices[1]));
}
}
BENCHMARK(BM_MultiplySingleCustom);
double TestCustom() {
const Matrix a = Matrix::Random();
const Matrix b = Matrix::Random();
const Matrix c = a * b;
const Matrix custom_c = CustomMultiply(a, b);
const double err = (c - custom_c).squaredNorm();
return err;
}
// Just sanity check the multiplication
void BM_TestCustom(benchmark::State& state) {
if (TestCustom() > 1e-10) {
exit(-1);
}
}
BENCHMARK(BM_TestCustom);
This yields the following mysterious report这产生了以下神秘的报告
Run on (20 X 3100 MHz CPU s)
CPU Caches:
L1 Data 32K (x10)
L1 Instruction 32K (x10)
L2 Unified 256K (x10)
L3 Unified 25600K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_PairwiseMultiplyEachMatrixNoAlias 28283 ns 28285 ns 20250
BM_PairwiseMultiplyEachMatrixCustom 14442 ns 14443 ns 48488
BM_MultiplySingle 791 ns 791 ns 876969
BM_MultiplySingleCustom 874 ns 874 ns 802052
BM_TestCustom 0 ns 0 ns 0
My current hypothesis is that the slowdown is attributable to instruction cache misses.我目前的假设是减速是由于指令缓存未命中。 It's possible Eigen's matrix multiply function does bad things to the instruction cache.
Eigen 的矩阵乘法 function 可能会对指令缓存造成不良影响。
VTune output for custom: VTune output 用于自定义:
VTune output for Eigen: VTune output 用于本征:
Maybe someone with more experience with VTune can tell me if I am interpreting this result correctly.也许对 VTune 有更多经验的人可以告诉我我是否正确解释了这个结果。 The DSB is the decoded instruction cache and MITE has something to do with instruction decoder bandwidth.
DSB 是解码后的指令缓存,而 MITE 与指令解码器带宽有关。 The Eigen version shows that most instructions are missing the DSB (66% miss rate) and a marked increase in stalling due to MITE bandwidth.
Eigen 版本显示大多数指令都缺少 DSB(66% 的未命中率),并且由于 MITE 带宽而导致的停顿显着增加。
Update 3: After getting reports that the single version of custom was faster, I also reproduced it on my machine.更新 3:收到报告说自定义的单版本更快后,我也在我的机器上复制了它。 This goes against @dtell's original findings on their machine.
这与@dtell 在他们机器上的原始发现背道而驰。
CPU Caches:
L1 Data 32K (x10)
L1 Instruction 32K (x10)
L2 Unified 256K (x10)
L3 Unified 25600K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------
BM_PairwiseMultiplyEachMatrixNoAlias 34787 ns 34789 ns 16477
BM_PairwiseMultiplyEachMatrixCustom 17901 ns 17902 ns 37759
BM_MultiplySingle 349 ns 349 ns 2054295
BM_MultiplySingleCustom 178 ns 178 ns 4624183
BM_TestCustom 0 ns 0 ns 0
I wonder if in my previous benchmark result I had left out an optimization flag.我想知道在我之前的基准测试结果中是否遗漏了优化标志。 In any case, I think the issue is confirmed that Eigen incurs an overhead when multiplying small matrices.
无论如何,我认为这个问题已经得到证实,即 Eigen 在乘以小矩阵时会产生开销。 If anyone out there has a machine that does not use a uop cache, I would be interested in seeing if the slowdown is less severe.
如果那里有人有一台不使用 uop 缓存的机器,我很想看看减速是否不那么严重。
(gdb) bt
#0 0x00005555555679e3 in Eigen::internal::gemm_pack_rhs<double, long, Eigen::internal::const_blas_data_mapper<double, long, 0>, 4, 0, false, false>::operator()(double*, Eigen::internal::const_blas_data_mapper<double, long, 0> const&, long, long, long, long) ()
#1 0x0000555555566654 in Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>::run(long, long, long, double const*, long, double const*, long, double*, long, double, Eigen::internal::level3_blocking<double, double>&, Eigen::internal::GemmParallelInfo<long>*) ()
#2 0x0000555555565822 in BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State&) ()
#3 0x000055555556d571 in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::Benchmark::Instance const*, unsigned long, int, benchmark::internal::ThreadManager*) ()
#4 0x000055555556b469 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#5 0x000055555556a450 in main ()
From stack trace, eigen's matrix multiplication is using a generic multiply method and loop through a dynamic matrix size.从堆栈跟踪来看,本征矩阵乘法使用通用乘法方法并循环通过动态矩阵大小。 For custom implementation, clang aggressively vectorize it and unroll loop, so there's much less branching.
对于自定义实现,clang 积极地对其进行矢量化并展开循环,因此分支要少得多。
Maybe there's some flag/option for eigen to generate code for this particular size to optimize.也许 eigen 有一些标志/选项可以为这个特定大小生成代码以进行优化。
However, if the matrix size is bigger, the Eigen version will perform much better than custom.但是,如果矩阵大小更大,Eigen 版本的性能将比自定义版本好得多。
I've rewritten your code using a proper benchmark library, namely Google Benchmark and cannot reproduce your measurements.我已经使用适当的基准库(即Google Benchmark )重写了您的代码,并且无法重现您的测量结果。
My results for -O0
where the second template parameter is the matrix dimension:我的
-O0
结果,其中第二个模板参数是矩阵维度:
Running ./benchmark
Run on (12 X 2900 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 262K (x6)
L3 Unified 12582K (x1)
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_CustomMultiply<double, 3> 5391 ns 5389 ns 105066
BM_CustomMultiply<double, 4> 9365 ns 9364 ns 73649
BM_CustomMultiply<double, 5> 15349 ns 15349 ns 44008
BM_CustomMultiply<double, 6> 20953 ns 20947 ns 32230
BM_CustomMultiply<double, 7> 33328 ns 33318 ns 21584
BM_CustomMultiply<double, 8> 44237 ns 44230 ns 15500
BM_CustomMultiply<double, 9> 57142 ns 57140 ns 11953
BM_CustomMultiply<double, 10> 69382 ns 69382 ns 9998
BM_EigenMultiply<double, 3> 2335 ns 2335 ns 295458
BM_EigenMultiply<double, 4> 1613 ns 1613 ns 457382
BM_EigenMultiply<double, 5> 4791 ns 4791 ns 142992
BM_EigenMultiply<double, 6> 3471 ns 3469 ns 206002
BM_EigenMultiply<double, 7> 9052 ns 9051 ns 78135
BM_EigenMultiply<double, 8> 8655 ns 8655 ns 81717
BM_EigenMultiply<double, 9> 11446 ns 11399 ns 67001
BM_EigenMultiply<double, 10> 15092 ns 15053 ns 46924
As you can see the number of iterations Google Benchmark uses is order of magnitudes higher that your benchmark.如您所见,Google Benchmark 使用的迭代次数比您的基准高几个数量级。 Micro-benchmarking is extremely hard especially when you deal with execution times of a few hundred nanoseconds.
微基准测试非常困难,尤其是在处理几百纳秒的执行时间时。
To be fair, calling your custom function involves a copy and manually inlining it gives a few nanoseconds, but still not beating Eigen.公平地说,调用您的自定义 function 涉及复制和手动内联它会产生几纳秒的时间,但仍然没有击败 Eigen。
Measurement with manually inlined CustomMultiply
and -O2 -DNDEBUG -march=native
:使用手动内联
CustomMultiply
和-O2 -DNDEBUG -march=native
进行测量:
Running ./benchmark
Run on (12 X 2900 MHz CPU s)
CPU Caches:
L1 Data 32K (x6)
L1 Instruction 32K (x6)
L2 Unified 262K (x6)
L3 Unified 12582K (x1)
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_CustomMultiply<double, 3> 51 ns 51 ns 11108114
BM_CustomMultiply<double, 4> 88 ns 88 ns 7683611
BM_CustomMultiply<double, 5> 147 ns 147 ns 4642341
BM_CustomMultiply<double, 6> 213 ns 213 ns 3205627
BM_CustomMultiply<double, 7> 308 ns 308 ns 2246391
BM_CustomMultiply<double, 8> 365 ns 365 ns 1904860
BM_CustomMultiply<double, 9> 556 ns 556 ns 1254953
BM_CustomMultiply<double, 10> 661 ns 661 ns 1027825
BM_EigenMultiply<double, 3> 39 ns 39 ns 17918807
BM_EigenMultiply<double, 4> 69 ns 69 ns 9931755
BM_EigenMultiply<double, 5> 119 ns 119 ns 5801185
BM_EigenMultiply<double, 6> 178 ns 178 ns 3838772
BM_EigenMultiply<double, 7> 256 ns 256 ns 2692898
BM_EigenMultiply<double, 8> 385 ns 385 ns 1826598
BM_EigenMultiply<double, 9> 546 ns 546 ns 1271687
BM_EigenMultiply<double, 10> 644 ns 644 ns 1104798
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.