简体   繁体   中英

Rcpp fast statistical mode function with vector input of any type

I'm trying to build a super fast mode function for R to use for aggregating large categorical datasets. The function should take vector input of all supported R types and return the mode. I have read This post , This Help-page and others, but I was not able to make the function take in all R data types. My code now works for numeric vectors, I am relying on Rcpp sugar wrapper functions:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
int Mode(NumericVector x, bool narm = false) 
{
    if (narm) x = x[!is_na(x)];
    NumericVector ux = unique(x);
    int y = ux[which_max(table(match(x, ux)))];
    return y;
}

In addition I was wondering if the ' narm ' argument can be renamed ' na.rm ' without giving errors, and of course if there is a faster way to code a mode function in C++, I would be grateful to know about it.

In order to make the function work for any vector input, you could implement @JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x)) . But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE> argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE> , then we can make use of RCPP_RETURN_VECTOR . Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR . One tricky thing is that you need the storage type for Vector<RTYPE> in order to create a suitable std::unordered_map . Here Rcpp::traits::storage_type<RTYPE>::type comes to the rescue. However, std::unordered_map does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.

Putting it all together:

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>

template <int RTYPE>
Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
  if (narm) x = x[!is_na(x)];
  int myMax = 1;
  Vector<RTYPE> myMode(1);
  // special case for factors == INTSXP with "class" and "levels" attribute
  if (x.hasAttribute("levels")){
    myMode.attr("class") = x.attr("class");
    myMode.attr("levels") = x.attr("levels");
  }
  std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
  modeMap.reserve(x.size());

  for (std::size_t i = 0, len = x.size(); i < len; ++i) {
    auto it = modeMap.find(x[i]);

    if (it != modeMap.end()) {
      ++(it->second);
      if (it->second > myMax) {
        myMax = it->second;
        myMode[0] = x[i];
      }
    } else {
      modeMap.insert({x[i], 1});
    }
  }

  return myMode;
}

template <>
Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
  stop("Not supported SEXP type!");
}

// [[Rcpp::export]]
SEXP fastMode( SEXP x, bool narm = false ){
  RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
}

/*** R
set.seed(1234)
s <- sample(1e5, replace = TRUE)
fastMode(s)
fastMode(s + 0.1)
l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 
fastMode(l)
c <- sample(letters, 1e5, replace = TRUE)
fastMode(c)
f <- as.factor(c)
fastMode(f) 
*/

Output:

> set.seed(1234)

> s <- sample(1e5, replace = TRUE)

> fastMode(s)
[1] 85433

> fastMode(s + 0.1)
[1] 85433.1

> l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 

> fastMode(l)
[1] TRUE

> c <- sample(letters, 1e5, replace = TRUE)

> fastMode(c)
[1] "z"

> f <- as.factor(c)

> fastMode(f) 
[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

As noted above, the used algorithm comes from Joseph Wood's answer , which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.

In your Mode function, since you are mostly calling sugar wrapper functions, you won't see that much improvement over base R . In fact, simply writing a faithful base R translation, we have:

baseMode <- function(x, narm = FALSE) {
    if (narm) x <- x[!is.na(x)]
    ux <- unique(x)
    ux[which.max(table(match(x, ux)))]
}

And benchmarking, we have:

set.seed(1234)
s <- sample(1e5, replace = TRUE)

library(microbenchmark)
microbenchmark(Mode(s), baseMode(s), times = 10, unit = "relative")
Unit: relative
       expr      min       lq     mean   median       uq      max neval
    Mode(s) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
baseMode(s) 1.490765 1.645367 1.571132 1.616061 1.637181 1.448306    10

Typically, when we undertake the effort of writing our own compiled code, we would expect bigger gains. Simply wrapping these already efficient compiled functions in Rcpp isn't going to magically get you the gains you expect. In fact, on larger examples the base solution is faster. Observe:

set.seed(1234)
sBig <- sample(1e6, replace = TRUE)

system.time(Mode(sBig))
 user  system elapsed 
1.410   0.036   1.450 

system.time(baseMode(sBig))
 user  system elapsed 
0.915   0.025   0.943 

To address your question of writing a faster mode function, we can make use of std::unordered_map , which is very similar to table underneath the hood (ie they are both hash tables at their heart). Additionally, since you are returning a single integer, we can safely assume that we can replace NumericVector with IntegerVector and also that you are not concerned with returning every value that occurs the most.

The algorithm below can be modified to return the true mode , but I will leave that as an exercise (hint: you will need std::vector along with taking some sort of action when it->second == myMax ). NB you will also need to add // [[Rcpp::plugins(cpp11)]] at the top of your cpp file for std::unordered_map and auto .

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>

// [[Rcpp::export]]
int fastIntMode(IntegerVector x, bool narm = false) {
    if (narm) x = x[!is_na(x)];
    int myMax = 1;
    int myMode = 0;
    std::unordered_map<int, int> modeMap;
    modeMap.reserve(x.size());

    for (std::size_t i = 0, len = x.size(); i < len; ++i) {
        auto it = modeMap.find(x[i]);

        if (it != modeMap.end()) {
            ++(it->second);
            if (it->second > myMax) {
                myMax = it->second;
                myMode = x[i];
            }
        } else {
            modeMap.insert({x[i], 1});
        }
    }

    return myMode;
}

And the benchmarks:

microbenchmark(Mode(s), baseMode(s), fastIntMode(s), times = 15, unit = "relative")
Unit: relative
          expr      min       lq     mean   median       uq      max neval
       Mode(s) 6.428343 6.268131 6.622914 6.134388 6.881746  7.78522    15
   baseMode(s) 9.757491 9.404101 9.454857 9.169315 9.018938 10.16640    15
fastIntMode(s) 1.000000 1.000000 1.000000 1.000000 1.000000  1.00000    15

Now we are talking... about 6x faster than the original and 9x faster than base. They all return the same value:

fastIntMode(s)
##[1] 85433

baseMode(s)
##[1] 85433

Mode(s)
##[1] 85433

And for our larger example:

## base R returned in 0.943s
system.time(fastIntMode(s))
 user  system elapsed 
0.217   0.006   0.224

In addition to the implicit CC-BY-SA license I hereby license the code in this answer under the GPL >= 2 .

To follow up with some shameless self-promotion, I have now published a package collapse on CRAN which includes a full set of Fast Statistical Functions , amonst them the generic function fmode . The implementation is based on index hashing and even faster than the solution above. fmode can be used to perform simple, grouped and/or weighted mode calculations on vectors, matrices, data.frames and dplyr grouped tibbles. Syntax:

fmode(x, g = NULL, w = NULL, ...)

where x is a vector, matrix, data.frame or grouped_df, g is a grouping vector or list of grouping vectors, and w is a vector of weights. A compact solution to categorical and mixed aggregation problems is further provided by the function collap . The code

collap(data, ~ id1 + id2, FUN = fmean, catFUN = fmode)

aggregates the mixed type data.frame data applying fmean to numeric and fmode to categorical columns. More customized calls are also possible. Together with the Fast Statistical Functions , collap is just as fast as data.table on large numeric data, and categorical and weighted aggregations are significantly faster than anything that can presently be done with data.table .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM