Eigen C ++：稀疏矩陣操作的性能

Question

任何人都可以解釋特征稀疏矩陣的以下行為嗎？ 我一直在研究別名和懶惰的評估，但我似乎無法改善這個問題。 技術規格：我在Ubuntu 16.10上使用最新的Eigen穩定版本，帶有g ++編譯器，沒有優化標志。

假設我以下列方式定義一個簡單的身份：

SparseMatrix<double> spIdent(N,N);
spIdent.reserve(N);
spIdent.setIdentity();

然后用它執行這些操作

spIdent-spIdent;
spIdent*spIdent;
spIdent - spIdent*spIdent;

並測量所有三個的計算時間。 我得到的就是這個

0 Computation time: 2.6e-05
1 Computation time: 2e-06 
2 Computation time: 1.10706

這意味着任何一個操作都很快，但組合速度非常慢。 noalias()方法僅為密集矩陣定義，而在我的密集示例中，它沒有太大的區別。 任何啟示？

MCVE：

#include <iostream>
#include <ctime>
#include "../Eigen/Sparse"

using namespace std;
using namespace Eigen;

int main() {

unsigned int N=2000000;

SparseMatrix<double> spIdent(N,N);
spIdent.reserve(N);
spIdent.setIdentity();

clock_t start=clock();
spIdent*spIdent;
cout << "0 Computation time: " << float(clock() - start)/1e6 << '\n';

start=clock();
spIdent-spIdent;
cout << "1 Computation time: " << float(clock() - start)/1e6 << '\n';

start=clock();
spIdent - (spIdent*spIdent);
cout << "2 Computation time: " << float(clock() - start)/1e6 << '\n';

return 0;

}

Answer 1

它並沒有像懶惰的評估一樣被優化，而且非常懶惰。 看看產品。 調用的代碼是（至少在本機上包含的任何Eigen版本中）：

template<typename Derived>
template<typename OtherDerived>
inline const typename SparseSparseProductReturnType<Derived,OtherDerived>::Type
SparseMatrixBase<Derived>::operator*(const SparseMatrixBase<OtherDerived> &other) const
{
  return typename SparseSparseProductReturnType<Derived,OtherDerived>::Type(derived(), other.derived());
}

它返回產品的表達式 （即懶惰）。 此表達式沒有任何操作，因此成本為零。 差異也是如此。 現在，當做aa*a ， a*a是一個表達式。 然后它會遇到operator- 。 這看到了右側的表達。 然后將該表達式評估為臨時（即成本時間），以便在operator-使用它。 為什么要評估一個臨時的？ 閱讀本文的邏輯（搜索“第二種情況”）。

operator-是CwiseBinaryOp ，產品表達式為右側。 CwiseBinaryOp做的第一件事是將右側分配給成員：

EIGEN_STRONG_INLINE CwiseBinaryOp(const Lhs& aLhs, const Rhs& aRhs, const BinaryOp& func = BinaryOp())
      : m_lhs(aLhs), m_rhs(aRhs), m_functor(func)

（ m_rhs(aRhs) ）又調用SparseMatrix構造函數：

/** Constructs a sparse matrix from the sparse expression \a other */
template<typename OtherDerived>
inline SparseMatrix(const SparseMatrixBase<OtherDerived>& other)
  : m_outerSize(0), m_innerSize(0), m_outerIndex(0), m_innerNonZeros(0)
{
  ...
  *this = other.derived();
}

反過來調用operator=哪個（如果我錯了，有人糾正我）總是觸發評估，在這種情況下，是一個臨時的。

Answer 2

好吧，正如人們提到的，在前兩個語句中，代碼完全被優化掉（我已經使用當前版本的g ++和-O3集進行了測試）。 反匯編顯示了第二個語句：

  400e78:   e8 03 fe ff ff          callq  400c80 <clock@plt>   # timing begins
  400e7d:   48 89 c5                mov    %rax,%rbp
  400e80:   e8 fb fd ff ff          callq  400c80 <clock@plt>   # timing ends

Fop第三部分實際上發生了一些事情，稱為Eigen庫代碼：

  400ede:   e8 9d fd ff ff          callq  400c80 <clock@plt>   # timing begins
  400ee3:   48 89 c5                mov    %rax,%rbp
  400ee6:   8b 44 24 58             mov    0x58(%rsp),%eax
  400eea:   39 44 24 54             cmp    %eax,0x54(%rsp)
  400eee:   c6 44 24 20 00          movb   $0x0,0x20(%rsp)
  400ef3:   48 89 5c 24 28          mov    %rbx,0x28(%rsp)
  400ef8:   48 89 5c 24 30          mov    %rbx,0x30(%rsp)
  400efd:   48 c7 44 24 38 00 00    movq   $0x0,0x38(%rsp)
  400f04:   00 00 
  400f06:   c6 44 24 40 01          movb   $0x1,0x40(%rsp)
  400f0b:   0f 85 99 00 00 00       jne    400faa <main+0x22a>
  400f11:   48 8d 4c 24 1f          lea    0x1f(%rsp),%rcx
  400f16:   48 8d 54 24 20          lea    0x20(%rsp),%rdx
  400f1b:   48 8d bc 24 90 00 00    lea    0x90(%rsp),%rdi
  400f22:   00 
  400f23:   48 89 de                mov    %rbx,%rsi
  400f26:   e8 25 1a 00 00          callq  402950 <_ZN5Eigen13CwiseBinaryOpINS_8internal20scalar_difference_opIdEEKNS_12SparseMatrixIdLi0EiEEKNS_19SparseSparseProductIRS6_S8_EEEC1ES8_RSA_RKS3_>
  400f2b:   48 8d bc 24 a0 00 00    lea    0xa0(%rsp),%rdi
  400f32:   00 
  400f33:   e8 18 02 00 00          callq  401150 <_ZN5Eigen12SparseMatrixIdLi0EiED1Ev>
  400f38:   e8 43 fd ff ff          callq  400c80 <clock@plt>   # timing ends

我想在這種情況下，編譯器無法確定計算結果是否未使用，與前兩種情況相反。

如果您查看文檔，那么您可以看到稀疏矩陣上的+類的簡單操作不返回矩陣，而是返回表示結果的CwiseUnaryOp 。 我想如果你不在某個地方使用這個類，那么結果矩陣永遠不會被構造出來。

Answer 3

我認為正如@ hfhc2所提到的，代碼中的前兩個語句完全由編譯器完全優化（因為其余的結果不需要）。 在第三個語句中，最有可能產生一個輔助中間變量來存儲spIdent*spIdent的臨時結果。 要清楚地看到這一點，請考慮以下示例，其中包括顯式復制分配：

#include <iostream>
#include <ctime>
#include <Eigen/Sparse>

using namespace std;
using namespace Eigen;

int main () {

   const unsigned int N = 2000000;

   SparseMatrix<double> spIdent(N,N);
   SparseMatrix<double> a(N,N), b(N,N), c(N,N);

   spIdent.reserve(N);
   spIdent.setIdentity();

   clock_t start = clock();
   a = spIdent*spIdent;
   cout << "0 Computation time: " << float(clock() - start)/1e6 << endl;

   start = clock();
   b = spIdent-spIdent;
   cout << "1 Computation time: " << float(clock() - start)/1e6 << endl;

   start = clock();
   c = a - b;
   cout << "2 Computation time: " << float(clock() - start)/1e6 << endl;

   return 0;

}

測量的時間（沒有編譯器優化）是[對於openSUSE 12.2（x86_64），g ++ 4.7.1，Intel 2core 2GHz CPU]：

0 Computation time: 1.58737
1 Computation time: 0.417798
2 Computation time: 0.428174

這似乎很合理。

Eigen C ++：稀疏矩陣操作的性能

問題描述

3 個解決方案

解決方案1
4 已采納 2016-08-03 13:08:12

解決方案2
3 2016-08-03 12:58:15

解決方案3
2 2016-08-03 13:30:01

Eigen C ++：稀疏矩陣操作的性能

問題描述

3 個解決方案

解決方案1 4 已采納 2016-08-03 13:08:12

解決方案2 3 2016-08-03 12:58:15

解決方案3 2 2016-08-03 13:30:01

解決方案1
4 已采納 2016-08-03 13:08:12

解決方案2
3 2016-08-03 12:58:15

解決方案3
2 2016-08-03 13:30:01