简体   繁体   中英

MM Estimation in Robust Regression

I am working with different linear regression models in R. I used the DATASET , which has 21263 rows and 82 columns.

All of the regression models have acceptable time consumption except the MM-estimate regression using the R function lmrob .

I was waiting for more than 10 hours to run the first for loop (#Block A), and it does not work. By "does not work", I mean It may give me an output after two days. I tried this code with a smaller DATASET which has 9568 rows, 5 columns and it runs in a one minute.

I am using my standard Laptop.

The steps of my analysis as follows

Uploading and scaling the dataset and then used k-folds split with k=30 because I want to calculate the variance of coefficients for each variable within the k split.

Could you please provide me with any guide?

wdbc = read.csv("train.csv") #critical_temp is the dependent varaible. 
wdbcc=as.data.frame(scale(wdbc)) # scaling the variables
### k-folds split ###
set.seed(12345)
k = 30
folds <- createFolds(wdbcc$critical_temp, k = k, list = TRUE, returnTrain = TRUE)

############ Start of MM Regression Model #################
#Block A
lmrob = list()
for (i in 1:k) {
    lmrob[[i]] = lmrob(critical_temp~ ., 
                       data = wdbcc[folds[[i]],],setting="KS2014")
}

#Block B
lmrob_coef = list()
lmrob_coef_var = list()

for(j in 1:(lmrob[[1]]$coefficients %>% length())){

    for(i in 1:k){

        lmrob_coef[[i]] = lmrob[[i]]$coefficients[j] 
        lmrob_coef_var[[j]] = lmrob_coef %>% unlist() %>% var()
    }

}

#Block C
lmrob_var = unlist(lmrob_coef_var)
lmrob_df = cbind(coefficients = lmrob[[1]]$coefficients %>% names() %>% as.data.frame()
                 , variance = lmrob_var %>% as.data.frame()) 
colnames(lmrob_df) = c("coefficients", "variance_lmrob")
#Block D
lmrob_var_sum = sum(lmrob_var)

Not an answer, but some code to help you test this for yourself. I didn't run lmrob() on the full dataset, but everything I show below suggests that one full realization of the model (all observations, all predictors) should run in about 10-20 minutes [on a 10-year old MacOS desktop machine], which would extrapolate to approximately 5 hours for 30-fold cross-validation. (It looks like the time scales a little worse than the square root of the number of observations, and nonlinearly even on the log scale with the number of predictors ...) You can try the code below to see if things are much slower on your machine, and to predict how long you think it should take to do the whole problem. Other general suggestions:

  • is there a chance you're running out of memory? Memory constraints can make things run much slower
  • if the problem is just that things are too slow, you can easily parallelize across folds if you have access to multiple cores (probably don't do this on a laptop, you'll burn it up)
  • AWS and other cloud services can be very useful

I set up a test function to record the time taken by lmrob() running on a random subset of predictors and observations from your data set.

Extract data, load packages:

unzip("superconduct.zip")
xx <- read.csv("train.csv")
library(robustbase)
library(ggplot2); theme_set(theme_bw())
library(cowplot)

Define a test function for timing lmrob runs for different numbers of observations and predictors:

nc <- ncol(xx)  ## response vble is last column, "critical_temp"
test <- function(nobs=1000,npred=10,seed=NULL, ...) {
    if (!is.null(seed)) set.seed(seed)
    dd <- xx[sample(nrow(xx),size=nobs),
             c(sample(nc-1,size=npred),nc)]
    tt <- system.time(fit <- lmrob(critical_temp ~ ., data=dd, ...))
    tt[c("user.self","sys.self","elapsed")]
}    

t0 <- test()

The minimal example here (1000 observations, 10 predictors) is very fast (0.2 seconds). This is the basic loop I ran:

res <- expand.grid(nobs=seq(1000,10000,by=1000), npred=seq(10,30,by=2))
res$user.self <- res$sys.self <- res$elapsed <- NA
for (i in seq(nrow(res))) {
    cat(res$nobs[i],res$npred[i],"\n")
    res[i,-(1:2)] <- test(res$nobs[i],res$npred[i],seed=101)
}

(As you can see in the plot below, I did this again with larger numbers of observations and predictors and used rbind() to combine the results into a single data frame.) I also tried fitting linear models to make predictions of the time taken to do the full data set with all predictors. (Plotting [see below] suggests that the time is log-log-linear in number of observations but nonlinear in number of predictors ...)

m1 <- lm(log10(elapsed)~poly(log10(npred),2)*log10(nobs), data=resc)
pp <- predict(m1, newdata=data.frame(npred=ncol(xx)-1,nobs=nrow(xx)),
              interval="confidence")  
10^pp  ## convert from log10(predicted seconds) to seconds

Test the full data set.

t_all <- test(nobs=nrow(xx),npred=ncol(xx)-1)

I then realized that you were using setting = "KS2014" (as suggested in the documentation) rather than the default: this is at least 5x slower, as suggested by the following comparison:

test(nobs=10000,npred=30)
test(nobs=10000,npred=30,setting = "KS2014")

I re-ran some of the stuff above with setting="KS2014" . Making the prediction for the full data set suggested a run-time of about 700 seconds (CI from 300 to 2000 seconds) - still nowhere near as slow as you're suggesting.

gg0 <- ggplot(resc2,aes(x=npred,y=elapsed,colour=nobs,linetype=setting))+
    geom_point()+geom_line(aes(group=interaction(nobs,setting)))+
    scale_x_log10()+scale_y_log10()
gg1 <- ggplot(resc2,aes(x=nobs,y=elapsed,colour=npred, linetype=setting))+
    geom_point()+geom_line(aes(group=interaction(npred,setting)))+
    scale_x_log10()+scale_y_log10()
plot_grid(gg0,gg1,nrow=1)
ggsave("lmrob_times.pdf")

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM