TRUE
.Because I need to be able to do this 1 - 2500 million times on a relatively regular basis speed actually really matters:
The most efficient / fastest single-process way I've figured how to do this is in the how many Rcpp function ( hm2
).
My limited profiling abilities show me that the vast majority of the time is spent in doing the if(r_tll == xcolls){...
. I can't seem to think of a different algorithm that would be faster ( I have tried the break out of the loop as soon as a FALSE
is found and it is much slower).
I can assume that:
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
head(m)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
#> [2,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
#> [4,] TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
#> [5,] TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
#> [6,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
// [[Rcpp::export]]
int hm(const LogicalMatrix& x){
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}
I don't understand why, but on my machine if I bake in the number of cols it is faster (if someone could explain why this is it would be great too):
// [[Rcpp::export]]
int hm2(const LogicalMatrix& x){
const int xrows = x.nrow();
// const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < 10; col++) {
r_ttl += x(row,col);
}
if(r_ttl == 10){
n_all_true += 1;
}
}
return n_all_true;
}
microbenchmark(hm(m), hm2(m), times = 1000)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> hm(m) 597.828 599.0995 683.3482 605.397 643.8655 1659.711 1000
#> hm2(m) 236.847 237.6565 267.8787 238.748 253.5280 683.221 1000
Can go 30% faster still with OpenMP (which I now see is against the question which requests single thread solutions), and minimal code changes, at least on my 4 core Xeon. I have a feeling that a logical AND reduction may do better but will leave that for another day:
library(Rcpp)
library(microbenchmark)
m_rows <- 10L
m_cols <- 50000L
rebuild = FALSE
cppFunction('int hm(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', rebuild = rebuild)
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
cppFunction('int hm_jmu(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', rebuild = rebuild)
macroExpand <- function(NCOL) {
paste0('int hm_npjc(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
macroExpand_omp <- function(NCOL) {
paste0('int hm_npjc_omp(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
#pragma omp parallel for reduction(+:n_all_true)
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
cppFunction(macroExpand(m_rows), rebuild = rebuild)
cppFunction(macroExpand_omp(m_rows), plugins = "openmp", rebuild = rebuild)
cppFunction('int hm_omp(const LogicalMatrix& x) {
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
#pragma omp parallel for reduction(+:n_all_true) schedule(static)
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', plugins = "openmp", rebuild = rebuild)
# using != as inner loop control - no difference, using pre-increment in n_all_true, no diff, static vs dynamic OpenMP, attempted to direct clang and gcc to unroll loops: didn't seem to work
set.seed(21)
m <- matrix(sample(c(TRUE, FALSE), m_cols * m_rows, replace = T), ncol = m_rows)
print(microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m),
hm_omp(m), hm_npjc_omp(m),
times = 1000))
I used GCC 4.9. Clang 3.7 similar results. Giving: Unit: microseconds expr min lq mean median uq max neval hm(m) 614.074 640.9840 643.24836 641.462 642.9920 976.694 1000 hm3(m) 2705.066 2768.3080 2948.39388 2775.992 2786.8625 43424.060 1000 hm_jmu(m) 591.179 612.3590 625.84484 612.881 613.8825 6874.428 1000 hm_npjc(m) 62.958 63.8965 64.89338 64.346 65.0550 144.487 1000 hm_omp(m) 91.892 92.6050 165.21507 93.758 98.8230 10026.583 1000 hm_npjc_omp(m) 43.129 43.6820 129.15842 44.458 47.0860 17636.875 1000
The OpenMP magic is just the inclusion of the -fopenmp
at compile and link time (taken care of by Rcpp, plugin="openmp"
), and #pragma omp parallel for reduction(+:n_all_true) schedule(static) In this case, the outer loop is parallelized, and the result is a sum, so the reduction statement tells the compiler to break down the problem, and reduce the sum of each part into a single sum. schedule(static)
describes how the compiler and/or runtime will allocate the loop between threads. In this case, the width of both inner and outer loops is known, so static
is preferred; if, say, the inner loop size varied a lot, then dynamic
might balance the work better between threads.
It is possible to explicitly tell OpenMP how many loop iterations you would like per thread, but it is often better to let the compiler decide.
On a different note, I tried hard to use compiler flags, such as -funroll-loops
to replace the ugly but fast hard-coding of the inner loop width (which is not a generalized solution to the question). I tested these to no avail: see https://github.com/jackwasey/optimization-comparison
Here's your function, and the output from compiling it via cppFunction
:
require(Rcpp)
cppFunction('int hm(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
# file.*.cpp: In function ‘int hm(const LogicalMatrix&)’:
# file.*.cpp:12:29: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
# for(size_t row = 0; row < xrows; row++) {
# ^
# file.*.cpp:14:31: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
# for(size_t col = 0; col < xcols; col++) {
# ^
Note the warnings. I can get a bit of an improvement by using int
instead of size_t
for both row
and col
. Other than that, I can't find much room for improvement.
And here's my code, benchmarks, and reproducible example:
require(Rcpp)
require(microbenchmark)
cppFunction('int hm_jmu(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
set.seed(21)
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark(hm(m), hm3(m), hm_jmu(m), times=1000)
# Unit: microseconds
# expr min lq median uq max neval
# hm(m) 578.844 594.1460 607.357 636.4410 858.347 1000
# hm3(m) 6389.014 6452.9595 6476.197 6735.5465 33720.870 1000
# hm_jmu(m) 409.920 415.0395 424.401 449.0075 650.127 1000
I was very curious as to why 'baking in' what was defined as a const
would make any difference; so I played around with this idea.
library(Rcpp)
library(microbenchmark)
cppFunction('int hm(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == 10){
n_all_true++;
}
}
return n_all_true;
}')
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
cppFunction('int hm_jmu(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
I'm just taking Joshua's sol'n here but generating the tailored function by code-gen works well on my machine. This seems hacky to me but I thought I would post anyway:
macroExpand <- function(NCOL) {
paste0('int hm_npjc(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
cppFunction(macroExpand(10L))
set.seed(21)
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m), times=1000)
#> Unit: microseconds
#> expr min lq mean median uq max
#> hm(m) 596.808 600.1870 722.5140 629.1750 709.3875 1680.379
#> hm3(m) 2189.164 2353.6700 2972.1463 2509.4630 2956.7675 49930.471
#> hm_jmu(m) 574.137 576.5160 678.6475 600.4775 665.2800 2240.988
#> hm_npjc(m) 81.978 83.1855 102.7646 89.2160 101.0400 380.884
#> neval
#> 1000
#> 1000
#> 1000
#> 1000
I would like to note that I don't really understand why the compiler doesn't optimize to the same solution here; if anyone has insight on this that would be awesome.
devtools::session_info()
#> Session info --------------------------------------------------------------
#> setting value
#> version R version 3.2.2 (2015-08-14)
#> system x86_64, darwin13.4.0
#> ui RStudio (0.99.691)
#> language (EN)
#> collate en_CA.UTF-8
#> tz America/Los_Angeles
#> date 2015-09-27
#> Packages ------------------------------------------------------------------
#> package * version date source
#> clipr 0.1.1 2015-09-04 CRAN (R 3.2.0)
#> colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
#> devtools 1.9.1 2015-09-11 CRAN (R 3.2.0)
#> digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
#> evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
#> formatR 1.2.1 2015-09-18 CRAN (R 3.2.0)
#> ggplot2 1.0.1 2015-03-17 CRAN (R 3.2.0)
#> gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
#> htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
#> knitr 1.10.5 2015-05-06 CRAN (R 3.2.0)
#> magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
#> MASS 7.3-43 2015-07-16 CRAN (R 3.2.2)
#> memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
#> microbenchmark * 1.4-2 2014-09-28 CRAN (R 3.2.0)
#> munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
#> plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
#> proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
#> Rcpp * 0.12.1 2015-09-10 CRAN (R 3.2.0)
#> reprex 0.0.0.9001 2015-09-26 Github (jennybc/reprex@1d6584a)
#> reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0)
#> rmarkdown 0.7 2015-06-13 CRAN (R 3.2.0)
#> rstudioapi 0.3.1 2015-04-07 CRAN (R 3.2.0)
#> scales 0.3.0 2015-08-25 CRAN (R 3.2.0)
#> stringi 0.5-5 2015-06-29 CRAN (R 3.2.0)
#> stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
How about exploiting the fact that TRUE
is coerced to 1
for many numeric operators, and then it's all vectorized in functions that are already programmed in CEg
set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)
sum(rowSums(m) == ncol(m))
## [1] 47
microbenchmark::microbenchmark(sum(rowSums(m) == ncol(m)))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum(rowSums(m) == ncol(m)) 1.715399 1.840763 1.873422 1.861552 1.905841 2.02524 100
See R Inferno Chapter 3.
Edited answer with direct comparison:
(here I pasted the two C++ functions into a file called test.cpp
on my Desktop with the usual Rcpp header info)
require(Rcpp)
sourceCpp("~/Desktop/test.cpp")
set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
microbenchmark::microbenchmark(hm(m), hm2(m), hm3(m), times = 1000)
## Unit: milliseconds
## expr min lq mean median uq max neval
## hm(m) 4.996005 5.036732 5.169672 5.089707 5.194580 9.961581 1000
## hm2(m) 5.031222 5.074990 5.228239 5.128106 5.242909 10.109776 1000
## hm3(m) 1.626933 1.878014 2.205195 1.922608 2.014012 226.894190 1000
I note here that the reference to R Inferno is not really appropriate since it does not apply to C++ but it's still a mantra to live by. :-)
Stumbled across this working on a similar problem. We can squeeze a little more performance by initializing r_ttl
on the first column and eliminating the if(r_ttl == xcols)
check:
# initialize on the first column
cppFunction('int hm2(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = x(row,0);
for(int col = 1; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
# use *= to eliminate the if statement
cppFunction('int hm3(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 1;
for(int col = 0; col < xcols; col++) {
r_ttl *= x(row,col);
}
n_all_true += r_ttl;
}
return n_all_true;
}')
# both modifications
cppFunction('int hm4(const LogicalMatrix& x) {
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = x(row,0);
for(int col = 1; col < xcols; col++) {
r_ttl *= x(row,col);
}
n_all_true += r_ttl;
}
return n_all_true;
}')
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark::microbenchmark(hm_jmu = hm_jmu(m),
hm2 = hm2(m),
hm3 = hm3(m),
hm4 = hm4(m),
check = "equal",
times = 1e4)
# Unit: microseconds
# expr min lq mean median uq max neval
# hm_jmu 198.3 200.8 218.5362 208.3 212.7 5855.9 10000
# hm2 169.3 170.9 184.8722 171.4 180.0 5775.4 10000
# hm3 192.7 196.1 209.7465 196.8 206.3 1056.2 10000
# hm4 161.9 163.1 176.2370 163.5 171.9 1119.2 10000
About a 20% improvement over Joshua Ulrich's even larger improvement.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.