Why isn't column-wise operation much faster than row-wise operation (as it should be) for a matrix in R

Question

Consider the following functions which store values row-wise-ly and column-wise-ly.

 #include <Rcpp.h>
using namespace Rcpp;

const int m = 10000;
const int n = 3;

// [[Rcpp::export]]
SEXP rowWise() {
    SEXP A = Rf_allocMatrix(INTSXP, m, n);
    int* p = INTEGER(A);
    int i, j;
    for (i = 0; i < m; i++){
        for(j = 0; j < n; j++) {
            p[m * j + i] = j;
        }
    }
    return A;
}

// [[Rcpp::export]]
SEXP columnWise() {
  SEXP A = Rf_allocMatrix(INTSXP, n, m);
  int* p = INTEGER(A);
  int i, j;
  for(j = 0; j < m; j++) {
    for (i = 0; i < n; i++){
      p[n * j + i] = i;
    }
  }
  return A;
}


/*** R
library(microbenchmark)
gc()
microbenchmark(
  rowWise(),
  columnWise(),
  times = 1000
)
*/

The above code yields

Unit: microseconds
         expr    min     lq     mean  median      uq       max neval
    rowWise() 12.524 18.631 64.24991 20.4540 24.8385 10894.353  1000
 columnWise() 11.803 19.434 40.08047 20.9005 24.1585  8590.663  1000

Assigning values row-wise-ly is faster (if not slower) than assigning them column-wise-ly, which is counter-intuitive to what I believe.

However, it does depend magically on the value of m and n . So I guess my question is: why columnWise is not much faster than rowWise ?

Answer 1

The dimension (shape) of the matrix has an impact.

When we do a row-wise scan of a 10000 x 3 integer matrix A , we can still effectively do caching. For simplicity of illustration, I assume that each column of A are aligned to a cache line.

--------------------------------------
A[1, 1] A[1, 2] A[1, 3]        M  M  M
A[2, 1] A[2, 2] A[2, 3]        H  H  H
   .        .       .          .  .  .
   .        .       .          .  .  .
A[16,1] A[16,2] A[16,3]        H  H  H
--------------------------------------
A[17,1] A[17,2] A[17,3]        M  M  M
A[18,1] A[18,2] A[18,3]        H  H  H
   .        .       .          .  .  .
   .        .       .          .  .  .
A[32,1] A[32,2] A[32,3]        H  H  H
--------------------------------------
A[33,1] A[33,2] A[33,3]        M  M  M
A[34,1] A[34,2] A[34,3]        H  H  H
   .        .       .          .  .  .
   .        .       .          .  .  .

A 64-bit cache line can hold 16 integers. When we access A[1, 1] , a full cache line is filled, that is, A[1, 1] to A[16, 1] are all loaded into cache. When we scan a row A[1, 1], A[1, 2], A[1, 3] , a 16 x 3 matrix is loaded into cache and it is much smaller than cache capacity (32 KB). While we have a cache miss (M) for each element in the 1st row, when we start to scan the 2nd row, we have a cache hit (H) for every element. So we have a periodic pattern as such:

[3 Misses] -> [45 Hits] -> [3 Misses] -> [45 Hits] -> ...

That is, we have on average a cache miss ratio of 3 / 48 = 1 / 16 = 6.25% . In fact, this equals to the cache miss ratio if we scan A column-wise, where we have the following periodic pattern:

[1 Miss] -> [15 Hits] -> [1 Miss] -> [15 Hits] -> ...

Try a 5000 x 5000 matrix. In that case, after reading the first row, 16 x 5000 elements are fetched into cache but that is much larger than cache capacity so cache eviction has happened to kick out the A[1, 1] to A[16, 1] ( most cache applies "least recently unused" cache line replacement policy ). When we come back to scan the 2nd row, we have to fetch A[2, 1] from RAM again. So a row-wise scan gives a cache miss ratio of 100% . In contrasts, a column-wise scan only has a cache miss ratio of 1 / 16 = 6.25% . In this example, we will observe that column-wise scan is much faster.

In summary, with a 10000 x 3 matrix, we have the same cache performance whether we scan it by row or column. I don't see that rowWise is faster than columnWise from the median time reported by microbenchmark . Their execution time may not be exactly equal, but the difference is too minor to cause our concern.

For a 5000 x 5000 matrix, rowWise is much slower than columnWise .

Thanks for verification.

Remark

The "golden rule" that we should ensure sequential memory access in the innermost loop is a general guideline for efficiency. But don't understand it in the narrow sense.

In fact, if you treat the three columns of A as three vectors x , y , z , and consider the element-wise addition (ie, the row-wise sum of A ): z[i] = x[i] + y[i] , are we not having a sequential access for all three vectors? Doesn't this fall into the "golden rule"? Scanning a 10000 x 3 matrix by row is no difference from alternately reading three vectors sequentially. And this is very efficient.

Why isn't column-wise operation much faster than row-wise operation (as it should be) for a matrix in R

Question

1 answers

solution1
11 ACCPTED 2018-08-07 07:01:39

Remark

Why isn't column-wise operation much faster than row-wise operation (as it should be) for a matrix in R

Question

1 answers

solution1 11 ACCPTED 2018-08-07 07:01:39

Remark

solution1
11 ACCPTED 2018-08-07 07:01:39