Short question: is it possible to use matrix-wise operations in caretSBF$score function?
Motivation: When working with big matrices in R, operations that work natively matrix-wise [eg rowMeans(X) ] are often much faster than one-row-at-a-time approaches [eg apply(X, 1, mean) ]. Here is a benchmark example, using a matrix with a million rows and 100 columns:
rows = 1000000
cols = 100
X <- matrix(rnorm(rows*cols),nrow = rows)
ptm <- proc.time()
tt <- apply(X, 1, function(x) { t.test(x[1:50],x[51:100], var.equal = FALSE)$p.value })
proc.time() - ptm
# user system elapsed
# 312.420 0.685 313.633
library(genefilter)
ptm <- proc.time()
ftt <- rowFtests(X, fac = factor(c(rep(0,50), rep(1,50))), var.equal=FALSE)
proc.time() - ptm
# user system elapsed
# 21.400 1.336 23.257
Details: In the caret package, the caretSBF functions score and filter can be used to select features for cross-validated modeling. I want to use a custom scoring function in place of caretSBF$score (this part I can do), but I want it to be matrix-wise (like above -- this part I can't do). When I first looked at the functions, I couldn't see obvious reason why this wouldn't work. I want to do something like this:
mySBF$score <- function(x, y) {
genefilter::rowFtests(x, fac = y)$p.value
}
In place of the default:
$score
function (x, y)
{
if (is.factor(y))
anovaScores(x, y)
else gamScores(x, y)
}
<environment: namespace:caret>
But I can't make it work. Are matrix-wise operations just not supported by caretSBF?
Are matrix-wise operations just not supported by caretSBF?
No, not really. The score
function is only served one predictor at a time.
However, you can get there using custom models in train
. Here is an example that conducts feature extraction prior to modeling. You can adapt this with a multivariate filter and use the subset to fit the model. Here is a really crappy example:
> library(caret)
> set.seed(1)
> training <- LPH07_1(200)
>
> crappy <- getModelInfo("lm", regex = FALSE)[[1]]
> crappy$fit <- function (x, y, wts, param, lev, last, classProbs, ...) {
+ dat <- if (is.data.frame(x)) x else as.data.frame(x)
+ ## randomly filter all but 3 predictors
+ dat <- dat[, sample(1:ncol(dat), 3)]
+ dat$.outcome <- y
+ lm(.outcome ~ ., data = dat, ...)
+ }
> crappy$predict <- function (modelFit, newdata, submodels = NULL) {
+ if (!is.data.frame(newdata))
+ newdata <- as.data.frame(newdata)
## make sure to apply the subsetting part here too
+ predict(modelFit, newdata[, predictors(modelFit$terms)])
+ }
>
>
> mod <- train(y ~ ., data = training,
+ method = crappy)
> mod
Linear Regression
200 samples
10 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
Resampling results
RMSE Rsquared RMSE SD Rsquared SD
3.08 0.077 0.258 0.0864
> predictors(mod)
[1] "Var08" "Var03" "Var04"
Max
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.