CPU的矩陣訪問和乘法優化

Question

我在Java中（借助JNI）制作了一些內部優化的矩陣包裝器。 需要對此進行確認， 您能否提供一些有關矩陣優化的提示？ 我要實現的是：

矩陣可以表示為四組緩沖區/數組，一組用於水平訪問，一組用於垂直訪問，一組用於對角訪問，以及僅在需要時用於計算矩陣元素的命令緩沖區。 這是一個例子。

Matrix signature: 

0  1  2  3  

4  5  6  7

8  9  1  3

3  5  2  9

First(hroizontal) set: 
horSet[0]={0,1,2,3} horSet[1]={4,5,6,7} horSet[2]={8,9,1,3} horSet[3]={3,5,2,9}

Second(vertical) set:
verSet[0]={0,4,8,3} verSet[1]={1,5,9,5} verSet[2]={2,6,1,2} verSet[3]={3,7,3,9}

Third(optional) a diagonal set:
diagS={0,5,1,9} //just in case some calculation needs this

Fourth(calcuation list, in a "one calculation one data" fashion) set:
calc={0,2,1,3,2,5} --->0 means multiply by the next element
                       1 means add the next element
                       2 means divide by the next element
                       so this list means
                       ( (a[i]*2)+3 ) / 5  when only a[i] is needed.
Example for fourth set: 
A.mult(2),   A.sum(3),  A.div(5), A.mult(B)
(to list)   (to list)  (to list) (calculate *+/ just in time when A is needed )
 so only one memory access for four operations.
 loop start
 a[i] = b[i] * ( ( a[i]*2) +3 ) / 5  only for A.mult(B)
 loop end

因此，如上所示，當一個需要訪問列元素時，第二個集合提供了連續的訪問。 沒有飛躍。 通過第一組水平訪問可以實現相同的目的。

這應該使某些事情更容易一些而更困難一些：

 Easier: 
 **Matrix transpozing operation. 
 Just swapping the pointers horSet[x] and verSet[x] is enough.

 **Matrix * Matrix multiplication.
 One matrix gives one of its horizontal set and other matrix gives vertical buffer.
 Dot product of these must be highly parallelizable for intrinsics/multithreading.
 If the multiplication order is inverse, then horizontal and verticals are switched.

 **Matrix * vector multiplication.
 Same as above, just a vector can be taken as horizontal or vertical freely.

 Harder:
 ** Doubling memory requirement is bad for many cases.
 ** Initializing a matrix takes longer.
 ** When a matrix is multiplied from left, needs an update vertical-->horizontal
 sets if its going to be multiplied from right after.(same for opposite)
 (if a tranposition is taken between, this does not count)


 Neutral:
 ** Same matrix can be multiplied with two other matrices to get two different
 results such as A=A*B(saved in horizontal sets)   A=C*A(saved in vertical sets)
 then A=A*A gives   A*B*C*A(in horizontal) and C*A*A*B (in vertical) without
 copying A. 

 ** If a matrix always multiplied from left or always from right, every access
 and multiplication will not need update and be contiguous on ram.

 ** Only using horizontals before transpozing, only using verticals after, 
 should not break any rules.

主要目的是具有（8的倍數，8的倍數）個大小的矩陣，並使用具有多個線程的avx內部函數（每個踏步並發地作用於一組集合）。

我只實現了vector * vector dotproduct。 如果您的編程大師給出了指導，那么我會做的。

我編寫的（帶內在函數）dotproduct比循環展開版本快6倍（乘積快一倍），當在包裝器中啟用多線程時（8x->使用將近20GB），它也卡在內存帶寬上限上/ s，接近我的ddr3極限）已經嘗試過opencl，對於cpu來說有點慢，但是對gpu來說很棒。

謝謝。

編輯： “塊矩陣”緩沖區將如何執行？ 當乘以大矩陣時，小補丁會以特殊方式相乘，並且緩存可能用於減少主內存訪問。 但這將需要在垂直-水平-對角線和此塊之間的矩陣乘法之間進行更多更新。

Answer 1

這實際上等效於緩存轉置。 聽起來您打算急切地這樣做。 我只會在需要時才計算換位，並記住它，以備再次需要時使用。 這樣，如果您永遠不需要它，就不會計算它。

Answer 2

一些庫使用表達式模板來啟用非常特定的優化功能，以用於一系列矩陣運算。

C ++編程語言也有一小節關於“融合操作”（29.5.4，第4版）。

這樣可以串聯以下語句：

M = A*B.transp(); // where M, A, B are matrices

在這種情況下，您需要3個類：

class Matrix;

class Transposed
{
public:
  Transposed(Matrix &matrix) : m_matrix(matrix) {}
  Matrix & obj (void) { return m_matrix; }
private:
  Matrix & m_matrix;
};

class MatrixMatrixMulTransPosed
{
public:
  MatrixMatrixMulTransPosed(Matrix &matrix, Transposed &trans) 
    : m_matrix(matrix), m_transposed(trans.obj()) {}
  Matrix & matrix (void) { return m_matrix; }
  Matrix & transposed (void) { return m_transposed; }
private:
  Matrix & m_matrix;
  Matrix & m_transposed;
};

class Matrix
{
  public:
    MatrixMatrixMulTransPosed operator* (Transposed &rhs)
    { 
      return MatrixMatrixMulTransPosed(*this, rhs); 
    }

    Matrix& operator= (MatrixMatrixMulTransPosed &mmtrans)
    {
      // Actual computation goes here and is stored in this.
      // using mmtrans.matrix() and mmtrans.transposed()
    }
};

您可以改進此概念，以使其能夠以任何方式對所有關鍵的計算都具有特殊功能。

CPU的矩陣訪問和乘法優化

問題描述

2 個解決方案

解決方案1
1 2013-07-18 00:21:34

解決方案2
1 已采納 2013-07-18 02:34:08

CPU的矩陣訪問和乘法優化

問題描述

2 個解決方案

解決方案1 1 2013-07-18 00:21:34

解決方案2 1 已采納 2013-07-18 02:34:08

解決方案1
1 2013-07-18 00:21:34

解決方案2
1 已采納 2013-07-18 02:34:08