R vs Rcpp與Armadillo中矩陣rowSums（）與colSums（）的效率

Question

背景

來自R編程，我正在使用Rcpp以C / C ++的形式擴展到已編譯的代碼。 作為循環交換效果的練習（一般只是C / C ++），我實現了R的rowSums()和colSums()函數與Rcpp的矩陣的等價物 （我知道它們存在於Rcpp糖和Armadillo中 - - 這只是一個練習）。

題

我在這個matsums.cpp文件中有我的C ++實現的rowSums()和colSums()以及Rcpp sugar和arma::sum()版本。 我只是這樣的簡單循環：

NumericVector Cpp_colSums(const NumericMatrix& x) {
  int nr = x.nrow(), nc = x.ncol();
  NumericVector ans(nc);
  for (int j = 0; j < nc; j++) {
    double sum = 0.0;
    for (int i = 0; i < nr; i++) {
      sum += x(i, j);
    }
    ans[j] = sum;
  }
  return ans;
}

NumericVector Cpp_rowSums(const NumericMatrix& x) {
  int nr = x.nrow(), nc = x.ncol();
  NumericVector ans(nr);
  for (int j = 0; j < nc; j++) {
    for (int i = 0; i < nr; i++) {
      ans[i] += x(i, j);
    }
  }
  return ans;
}

（ R矩陣存儲為列主要，因此外循環中的列應該是更有效的方法。這就是我最初測試的。 ）

在運行基准測試的同時，我遇到了一些我沒想到的事情：行總和與總和之間存在明顯的性能差異（參見下面的基准測試）：

使用內置R里面的函數， colSums()為約快一倍rowSums()
用我自己的RCPP和糖的版本，這是相反的：它是rowSums()也差不多快兩倍， colSums()
最后，添加Armadillo實現，操作大致相等（col sum可能更快一些，因為我預計它們也會在R中）。

所以我的主要問題是： 為什么Cpp_rowSums()明顯快於Cpp_colSums() ？

作為次要的興趣，我也很好奇為什么在R實現中相反的差異被顛倒了。 （我瀏覽了C源代碼，但無法確定顯着差異。）（第三，Armadillo如何獲得相同的性能？）

基准

我在10,000 x 10,000對稱矩陣上測試了兩種函數的所有4種實現：

Rcpp::sourceCpp("matsums.cpp")

set.seed(92136)
n <- 1e4 # build n x n test matrix
x <- matrix(rnorm(n), 1, n)
x <- crossprod(x, x) # symmetric

bench::mark(
  rowSums(x),
  colSums(x),
  Cpp_rowSums(x),
  Cpp_colSums(x),
  Sugar_rowSums(x),
  Sugar_colSums(x),
  Arma_rowSums(x),
  Arma_colSums(x)
)[, 1:7]

#> # A tibble: 8 x 7
#>   expression            min     mean   median      max `itr/sec` mem_alloc
#>   <chr>            <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 rowSums(x)        192.2ms  207.9ms  194.6ms  236.9ms      4.81    78.2KB
#> 2 colSums(x)         93.4ms   97.2ms   96.5ms  101.3ms     10.3     78.2KB
#> 3 Cpp_rowSums(x)     73.5ms   76.3ms     76ms   80.4ms     13.1     80.7KB
#> 4 Cpp_colSums(x)    126.5ms  127.6ms  126.8ms  130.3ms      7.84    80.7KB
#> 5 Sugar_rowSums(x)   73.9ms   75.6ms   74.3ms   79.4ms     13.2     80.7KB
#> 6 Sugar_colSums(x)  124.2ms  125.8ms  125.6ms  127.9ms      7.95    80.7KB
#> 7 Arma_rowSums(x)    73.2ms   74.7ms   73.9ms   79.3ms     13.4     80.7KB
#> 8 Arma_colSums(x)    62.8ms   64.4ms   63.7ms   69.6ms     15.5     80.7KB

（同樣，你可以在這里找到C ++源文件matsums.cpp 。）

平台：

> sessioninfo::platform_info()
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows >= 8 x64            
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 tz       Europe/Helsinki             
 date     2018-08-09

更新

進一步研究，我還使用R的傳統C接口編寫了相同的函數：源代碼在這里。 我用R CMD SHLIB 編譯了這些函數，並再次進行了測試：行總和的速度比col總和還要快（基准測試）。 然后我又看了看與拆卸objdump ，但在我看來，（我很有限ASM的理解），編譯器並沒有真正優化主循環體（行， COLS從C代碼）任何進一步的，要么？

Answer 1

首先，讓我在筆記本電腦上顯示時序統計信息。 我使用5000 x 5000矩陣，足以進行基准測試， microbenchmark包用於100次評估。

Unit: milliseconds
             expr       min        lq      mean    median        uq       max
       colSums(x)  71.40671  71.64510  71.80394  71.72543  71.80773  75.07696
   Cpp_colSums(x)  71.29413  71.42409  71.65525  71.48933  71.56241  77.53056
 Sugar_colSums(x)  73.05281  73.19658  73.38979  73.25619  73.31406  76.93369
  Arma_colSums(x)  39.08791  39.34789  39.57979  39.43080  39.60657  41.70158
       rowSums(x) 177.33477 187.37805 187.57976 187.49469 187.73155 194.32120
   Cpp_rowSums(x)  54.00498  54.37984  54.70358  54.49165  54.73224  64.16104
 Sugar_rowSums(x)  54.17001  54.38420  54.73654  54.56275  54.75695  61.80466
  Arma_rowSums(x)  49.54407  49.77677  50.13739  49.90375  50.06791  58.29755

R核心中的C代碼並不總是比我們自己寫的更好。 Cpp_rowSums比rowSums快得多。 我不覺得自己有能力解釋為什么R的版本比應該的慢。 我將重點關注： 如何進一步優化我們自己的colSums和rowSums以擊敗Armadillo 。 請注意，我編寫C，使用R的舊C接口並使用R CMD SHLIB進行編譯。

`colSums`和`rowSums`之間有什么實質性的區別嗎？

如果我們有一個nxn矩陣比一個CPU緩存的容量大得多， colSums負載nxn從RAM中緩存數據，但rowSums負載的兩倍多，即2 xnxn 。

想想保存總和的結果向量：這個長度為n向量從RAM加載到緩存中的次數是多少？ 對於colSums ，它只加載一次，但對於rowSums ，它加載n次。 每次向它添加一個矩陣列時，它都會被加載到緩存中，但由於它太大而被逐出。

對於大n ：

colSums導致colSums nxn + n數據從RAM加載到緩存;
rowSums導致從RAM到緩存的rowSums nxn + nxn rowSums數據加載。

換句話說， rowSums理論上內存效率較低，而且速度可能較慢。

如何提高`colSums`的性能？

由於RAM和緩存之間的數據流很容易達到最佳，因此唯一的改進是循環展開。 將內循環（求和循環）展開2的深度就足夠了，我們將看到2倍的提升。

循環展開工作，因為它啟用CPU的指令管道。 如果我們每次迭代只做一次加法，就不可能進行流水線操作; 通過兩個附加功能，這種指令級並行性開始起作用。 我們也可以將循環展開深度為4，但我的經驗是深度2展開足以從循環展開中獲得大部分好處。

如何提高`rowSums`的性能？

優化數據流是第一步。 我們需要先做緩存阻塞，以減少數據傳輸2 xnxn到nxn 。

將此nxn矩陣nxn為多個行塊：每個塊為2040 xn （最后一個塊可能更小），然后按塊應用普通rowSums塊。 對於每個塊，累加器向量的長度為2040，大約是32KB CPU高速緩存可容納的一半。 對於添加到該累加器矢量的矩陣列，另一半是反轉的。 以這種方式，累加器向量可以保持在高速緩存中，直到處理該塊中的所有矩陣列。 因此，累加器向量僅被加載到高速緩存中一次，因此整體存儲器性能與colSums性能一樣好。

現在我們可以進一步為每個塊中的rowSums應用循環展開。 展開外環和內環的深度為2，我們將看到一個提升。 一旦外循環展開，塊大小應該減少到1360，因為現在我們需要緩存中的空間來保持每個外循環迭代三個長度為1360的向量。

C代碼：讓我們擊敗犰狳

使用循環展開編寫代碼可能是一項令人討厭的工作，因為我們現在需要為函數編寫幾個不同的版本。

對於colSums ，我們需要兩個版本：

colSums_1x1 ：內部循環和外部循環都以深度1展開，即，這是一個沒有循環展開的版本;
colSums_2x1 ：沒有外循環展開，而內循環展開深度為2。

對於rowSums我們最多可以有四個版本， rowSums_sxt ，其中s = 1 or 2是內循環的展開深度， t = 1 or 2是外循環的展開深度。

如果我們逐個編寫每個版本，代碼編寫可能非常繁瑣。 經過多年或對此的沮喪，我開發了一個使用內聯模板函數和宏的“自動代碼/版本生成”技巧。

#include <stdlib.h>
#include <Rinternals.h>

static inline void colSums_template_sx1 (size_t s,
                                         double *A, size_t LDA,
                                         size_t nr, size_t nc,
                                         double *sum) {

  size_t nrc = nr % s, i;
  double *A_end = A + LDA * nc, a0, a1;

  for (; A < A_end; A += LDA) {
    a0 = 0.0; a1 = 0.0;  // accumulator register variables
    if (nrc > 0) a0 = A[0];  // is there a "fractional loop"?
    for (i = nrc; i < nr; i += s) {  // main loop of depth-s
      a0 += A[i];  // 1st iteration
      if (s > 1) a1 += A[i + 1];  // 2nd iteration
      }
    if (s > 1) a0 += a1;  // combine two accumulators
    *sum++ = a0;  // write-back
    }

  }

#define macro_define_colSums(s, colSums_sx1) \
SEXP colSums_sx1 (SEXP matA) { \
  double *A = REAL(matA); \
  size_t nrow_A = (size_t)nrows(matA); \
  size_t ncol_A = (size_t)ncols(matA); \
  SEXP result = PROTECT(allocVector(REALSXP, ncols(matA))); \
  double *sum = REAL(result); \
  colSums_template_sx1(s, A, nrow_A, nrow_A, ncol_A, sum); \
  UNPROTECT(1); \
  return result; \
  }

macro_define_colSums(1, colSums_1x1)
macro_define_colSums(2, colSums_2x1)

模板函數計算（在R語法中）對於具有LDA （A的前導維度）行的矩陣A sum <- colSums(A[1:nr, 1:nc]) 。 參數s是內循環展開的版本控制。 模板函數乍一看看起來很可怕，因為它包含很多if 。 但是，它被聲明為static inline 。 如果它是通過使已知的常數1或2至稱為s ，優化編譯器能夠評估那些if在編譯時，消除無法訪問的代碼放“設置的但不使用的”變量（寄存器被初始化的變量，修改但不寫回RAM）。

宏用於函數聲明。 接受常量s和函數名稱，它會生成一個具有所需循環展開版本的函數。

以下是rowSums 。

static inline void rowSums_template_sxt (size_t s, size_t t,
                                         double *A, size_t LDA,
                                         size_t nr, size_t nc,
                                         double *sum) {

  size_t ncr = nc % t, nrr = nr % s, i;
  double *A_end = A + LDA * nc, *B;
  double a0, a1;

  for (i = 0; i < nr; i++) sum[i] = 0.0;  // necessary initialization

  if (ncr > 0) {  // is there a "fractional loop" for the outer loop?
    if (nrr > 0) sum[0] += A[0];  // is there a "fractional loop" for the inner loop?
    for (i = nrr; i < nr; i += s) {  // main inner loop with depth-s
      sum[i] += A[i];
      if (s > 1) sum[i + 1] += A[i + 1];
      }
    A += LDA;
    }

  for (; A < A_end; A += t * LDA) {  // main outer loop with depth-t
    if (t > 1) B = A + LDA;
    if (nrr > 0) {  // is there a "fractional loop" for the inner loop?
      a0 = A[0]; if (t > 1) a0 += A[LDA];
      sum[0] += a0;
      }
    for(i = nrr; i < nr; i += s) {  // main inner loop with depth-s
      a0 = A[i]; if (t > 1) a0 += B[i];
      sum[i] += a0;
      if (s > 1) {
        a1 = A[i + 1]; if (t > 1) a1 += B[i + 1];
        sum[i + 1] += a1;
        }
      }
    }

  }

#define macro_define_rowSums(s, t, rowSums_sxt) \
SEXP rowSums_sxt (SEXP matA, SEXP chunk_size) { \
  double *A = REAL(matA); \
  size_t nrow_A = (size_t)nrows(matA); \
  size_t ncol_A = (size_t)ncols(matA); \
  SEXP result = PROTECT(allocVector(REALSXP, nrows(matA))); \
  double *sum = REAL(result); \
  size_t block_size = (size_t)asInteger(chunk_size); \
  size_t i, block_size_i; \
  if (block_size > nrow_A) block_size = nrow_A; \
  for (i = 0; i < nrow_A; i += block_size_i) { \
    block_size_i = nrow_A - i; if (block_size_i > block_size) block_size_i = block_size; \
    rowSums_template_sxt(s, t, A, nrow_A, block_size_i, ncol_A, sum); \
    A += block_size_i; sum += block_size_i; \
    } \
  UNPROTECT(1); \
  return result; \
  }

macro_define_rowSums(1, 1, rowSums_1x1)
macro_define_rowSums(1, 2, rowSums_1x2)
macro_define_rowSums(2, 1, rowSums_2x1)
macro_define_rowSums(2, 2, rowSums_2x2)

請注意，模板函數現在接受s和t ，並且宏定義的函數已應用行分塊。

即使我在代碼中留下了一些注釋，但代碼可能仍然不容易理解，但我不能花更多時間來詳細解釋。

要使用它們，請將它們復制並粘貼到名為“matSums.c”的C文件中，然后使用R CMD SHLIB -c matSums.c 。

對於R側，在“matSums.R”中定義以下功能。

colSums_zheyuan <- function (A, s) {
  dyn.load("matSums.so")
  if (s == 1) result <- .Call("colSums_1x1", A)
  if (s == 2) result <- .Call("colSums_2x1", A)
  dyn.unload("matSums.so")
  result
  }

rowSums_zheyuan <- function (A, chunk.size, s, t) {
  dyn.load("matSums.so")
  if (s == 1 && t == 1) result <- .Call("rowSums_1x1", A, as.integer(chunk.size))
  if (s == 2 && t == 1) result <- .Call("rowSums_2x1", A, as.integer(chunk.size))
  if (s == 1 && t == 2) result <- .Call("rowSums_1x2", A, as.integer(chunk.size))
  if (s == 2 && t == 2) result <- .Call("rowSums_2x2", A, as.integer(chunk.size))
  dyn.unload("matSums.so")
  result
  }

現在讓我們有一個基准，再次使用5000 x 5000矩陣。

A <- matrix(0, 5000, 5000)

library(microbenchmark)
source("matSums.R")

microbenchmark("col0" = colSums(A),
               "col1" = colSums_zheyuan(A, 1),
               "col2" = colSums_zheyuan(A, 2),
               "row0" = rowSums(A),
               "row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

我的筆記本電腦上有：

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 col0  65.33908  71.67229  71.87273  71.80829  71.89444 111.84177   100
 col1  67.16655  71.84840  72.01871  71.94065  72.05975  77.84291   100
 col2  35.05374  38.98260  39.33618  39.09121  39.17615  53.52847   100
 row0 159.48096 187.44225 185.53748 187.53091 187.67592 202.84827   100
 row1  49.65853  54.78769  54.78313  54.92278  55.08600  60.27789   100
 row2  49.42403  54.56469  55.00518  54.74746  55.06866  60.31065   100
 row3  37.43314  41.57365  41.58784  41.68814  41.81774  47.12690   100
 row4  34.73295  37.20092  38.51019  37.30809  37.44097  99.28327   100

請注意循環展開如何加速colSums和rowSums 。 通過全面優化（“col2”和“row4”），我們擊敗了犰狳（請參閱本答案開頭的時間表）。

在這種情況下，行分塊策略並未明顯產生效益。 讓我們嘗試一個包含數百萬行的矩陣。

A <- matrix(0, 1e+7, 20)
microbenchmark("row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

我明白了

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
 row1 604.7202 607.0256 617.1687 607.8580 609.1728 720.1790   100
 row2 514.7488 515.9874 528.9795 516.5193 521.4870 636.0051   100
 row3 412.1884 413.8688 421.0790 414.8640 419.0537 525.7852   100
 row4 377.7918 379.1052 390.4230 379.9344 386.4379 476.9614   100

在這種情況下，我們觀察緩存阻塞的收益。

最后的想法

基本上，這個答案已經解決了所有問題，除了以下內容：

為什么R的rowSums效率低於它應該的效率。
為什么沒有任何優化， rowSums （“row1”）比colSums （“col1”）快。

同樣，我無法解釋第一個，實際上我並不關心，因為我們可以輕松編寫比R的內置版本更快的版本。

第二個絕對值得追求。 我在我們的討論室中復制我的評論以備記錄。

這個問題歸結為：“為什么添加單個向量比逐個添加兩個向量慢？”

我不時看到類似的現象。 我第一次遇到這種奇怪的行為是幾年前我編碼自己的矩陣 - 矩陣乘法。 我發現DAXPY比DDOT快。

DAXPY這樣做： y += a * x ，其中x和y是向量， a是標量; DDOT這樣做： a += x * y 。

鑒於DDOT是減速操作，我希望它比DAXPY快。 但不，DAXPY更快。

但是，只要我在矩陣乘法的三重循環嵌套中展開循環，DDOT就比DAXPY快得多。

你的問題恰好發生了類似的事情。 縮減操作： a = x[1] + x[2] + ... + x[n]比逐元素加法慢： y[i] += x[i] 。 但是一旦完成循環展開，后者的優勢就會喪失。

我不確定以下解釋是否屬實，因為我沒有證據。

約簡操作具有依賴鏈，因此計算嚴格連續; 另一方面，元素操作沒有依賴鏈，因此CPU可以用它做得更好。

一旦我們展開循環，每次迭代都會有更多的算術，CPU可以在兩種情況下進行流水線操作。 然后可以觀察到還原操作的真正優點。

在回答夏侯使用`rowSums2`和`colSums2`從`matrixStats`

仍然使用上面的5000 x 5000示例。

A <- matrix(0, 5000, 5000)

library(microbenchmark)
source("matSums.R")
library(matrixStats)  ## NEW

microbenchmark("col0" = base::colSums(A),
               "col*" = matrixStats::colSums2(A),  ## NEW
               "col1" = colSums_zheyuan(A, 1),
               "col2" = colSums_zheyuan(A, 2),
               "row0" = base::rowSums(A),
               "row*" = matrixStats::rowSums2(A),  ## NEW
               "row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 col0  71.53841  71.72628  72.13527  71.81793  71.90575  78.39645   100
 col*  75.60527  75.87255  76.30752  75.98990  76.18090  87.07599   100
 col1  71.67098  71.86180  72.06846  71.93872  72.03739  77.87816   100
 col2  38.88565  39.03980  39.57232  39.08045  39.16790  51.39561   100
 row0 187.44744 187.58121 188.98930 187.67168 187.86314 206.37662   100
 row* 158.08639 158.26528 159.01561 158.34864 158.62187 174.05457   100
 row1  54.62389  54.81724  54.97211  54.92394  55.04690  56.33462   100
 row2  54.15409  54.44208  54.78769  54.59162  54.76073  60.92176   100
 row3  41.43393  41.63886  42.57511  41.73538  41.81844 111.94846   100
 row4  37.07175  37.25258  37.45033  37.34456  37.47387  43.14157   100

我沒有看到rowSums2和colSums2性能優勢。

Answer 2

“為什么Cpp_rowSums（）明顯快於Cpp_colSums（）？” - 在獲取“行主要”時，CPU預取程序可以預測您正在執行的操作，並在需要之前從主內存到CPU緩存中獲取下一組數據。 這可以加快您對數據的訪問速度。

當你訪問“專業列”時，預取器有一個更難的工作來預測你接下來需要什么，所以它不會提前將內容填充到高速緩存內容中（如果有的話） - 這會減慢你的速度。

CPU 喜歡線性訪問數據。 如果你不做他們喜歡的事情，你需要支付緩存未命中和主存儲器訪問延遲的代價。

R vs Rcpp與Armadillo中矩陣rowSums（）與colSums（）的效率

問題描述

背景

題

基准

更新

2 個解決方案

解決方案1
11 已采納 2018-08-09 22:12:04

`colSums`和`rowSums`之間有什么實質性的區別嗎？

如何提高`colSums`的性能？

如何提高`rowSums`的性能？

C代碼：讓我們擊敗犰狳

最后的想法

在回答夏侯使用`rowSums2`和`colSums2`從`matrixStats`

解決方案2
1 2018-08-09 19:48:12

R vs Rcpp與Armadillo中矩陣rowSums（）與colSums（）的效率

問題描述

背景

題

基准

更新

2 個解決方案

解決方案1 11 已采納 2018-08-09 22:12:04

colSums和rowSums之間有什么實質性的區別嗎？

如何提高colSums的性能？

如何提高rowSums的性能？

C代碼：讓我們擊敗犰狳

最后的想法

在回答夏侯使用rowSums2和colSums2從matrixStats

解決方案2 1 2018-08-09 19:48:12

解決方案1
11 已采納 2018-08-09 22:12:04

`colSums`和`rowSums`之間有什么實質性的區別嗎？

如何提高`colSums`的性能？

如何提高`rowSums`的性能？

在回答夏侯使用`rowSums2`和`colSums2`從`matrixStats`

解決方案2
1 2018-08-09 19:48:12