简体   繁体   中英

Caret: customizing feature selection using matrix-wise operations

Short question: is it possible to use matrix-wise operations in caretSBF$score function?

Motivation: When working with big matrices in R, operations that work natively matrix-wise [eg rowMeans(X) ] are often much faster than one-row-at-a-time approaches [eg apply(X, 1, mean) ]. Here is a benchmark example, using a matrix with a million rows and 100 columns:

rows = 1000000
cols = 100
X <- matrix(rnorm(rows*cols),nrow = rows)

ptm <- proc.time()
tt <- apply(X, 1, function(x) { t.test(x[1:50],x[51:100], var.equal = FALSE)$p.value })
proc.time() - ptm
#    user  system elapsed
# 312.420   0.685 313.633

library(genefilter)
ptm <- proc.time()
ftt <- rowFtests(X, fac = factor(c(rep(0,50), rep(1,50))), var.equal=FALSE)
proc.time() - ptm
#    user  system elapsed
#  21.400   1.336  23.257

Details: In the caret package, the caretSBF functions score and filter can be used to select features for cross-validated modeling. I want to use a custom scoring function in place of caretSBF$score (this part I can do), but I want it to be matrix-wise (like above -- this part I can't do). When I first looked at the functions, I couldn't see obvious reason why this wouldn't work. I want to do something like this:

mySBF$score <- function(x, y) {
  genefilter::rowFtests(x, fac = y)$p.value
}

In place of the default:

$score
function (x, y) 
{
    if (is.factor(y)) 
        anovaScores(x, y)
    else gamScores(x, y)
}
<environment: namespace:caret>

But I can't make it work. Are matrix-wise operations just not supported by caretSBF?

Are matrix-wise operations just not supported by caretSBF?

No, not really. The score function is only served one predictor at a time.

However, you can get there using custom models in train . Here is an example that conducts feature extraction prior to modeling. You can adapt this with a multivariate filter and use the subset to fit the model. Here is a really crappy example:

> library(caret)
> set.seed(1)
> training <- LPH07_1(200)
> 
> crappy <- getModelInfo("lm", regex = FALSE)[[1]]
> crappy$fit <- function (x, y, wts, param, lev, last, classProbs, ...)  {
+   dat <- if (is.data.frame(x)) x else as.data.frame(x)
+   ## randomly filter all but 3 predictors
+   dat <- dat[, sample(1:ncol(dat), 3)]
+   dat$.outcome <- y
+   lm(.outcome ~ ., data = dat, ...)
+ }
> crappy$predict <-  function (modelFit, newdata, submodels = NULL) {
+   if (!is.data.frame(newdata)) 
+     newdata <- as.data.frame(newdata)
    ## make sure to apply the subsetting part here too
+   predict(modelFit, newdata[, predictors(modelFit$terms)])
+ }
> 
> 
> mod <- train(y ~ ., data = training, 
+              method = crappy)
> mod
Linear Regression 

200 samples
 10 predictor

No pre-processing
Resampling: Bootstrapped (25 reps) 

Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 

Resampling results

  RMSE  Rsquared  RMSE SD  Rsquared SD
  3.08  0.077     0.258    0.0864     


> predictors(mod)
[1] "Var08" "Var03" "Var04"

Max

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM