简体   繁体   English

稳健回归中的 MM 估计

[英]MM Estimation in Robust Regression

I am working with different linear regression models in R. I used the DATASET , which has 21263 rows and 82 columns.我正在 R 中使用不同的线性回归模型。我使用了DATASET ,它有 21263 行和 82 列。

All of the regression models have acceptable time consumption except the MM-estimate regression using the R function lmrob .除了使用 R 函数lmrob的 MM 估计回归之外,所有回归模型都具有可接受的时间消耗。

I was waiting for more than 10 hours to run the first for loop (#Block A), and it does not work.我等了 10 多个小时才运行第一个 for 循环(#Block A),但它不起作用。 By "does not work", I mean It may give me an output after two days. “不起作用”是指两天后它可能会给我一个输出。 I tried this code with a smaller DATASET which has 9568 rows, 5 columns and it runs in a one minute.我用一个较小的DATASET尝试了这段代码,它有 9568 行、5 列,它在一分钟内运行。

I am using my standard Laptop.我正在使用我的标准笔记本电脑。

The steps of my analysis as follows我的分析步骤如下

Uploading and scaling the dataset and then used k-folds split with k=30 because I want to calculate the variance of coefficients for each variable within the k split.上传和缩放数据集,然后使用 k=30 的 k 折拆分,因为我想计算 k 拆分中每个变量的系数方差。

Could you please provide me with any guide?你能给我提供任何指南吗?

wdbc = read.csv("train.csv") #critical_temp is the dependent varaible. 
wdbcc=as.data.frame(scale(wdbc)) # scaling the variables
### k-folds split ###
set.seed(12345)
k = 30
folds <- createFolds(wdbcc$critical_temp, k = k, list = TRUE, returnTrain = TRUE)

############ Start of MM Regression Model #################
#Block A
lmrob = list()
for (i in 1:k) {
    lmrob[[i]] = lmrob(critical_temp~ ., 
                       data = wdbcc[folds[[i]],],setting="KS2014")
}

#Block B
lmrob_coef = list()
lmrob_coef_var = list()

for(j in 1:(lmrob[[1]]$coefficients %>% length())){

    for(i in 1:k){

        lmrob_coef[[i]] = lmrob[[i]]$coefficients[j] 
        lmrob_coef_var[[j]] = lmrob_coef %>% unlist() %>% var()
    }

}

#Block C
lmrob_var = unlist(lmrob_coef_var)
lmrob_df = cbind(coefficients = lmrob[[1]]$coefficients %>% names() %>% as.data.frame()
                 , variance = lmrob_var %>% as.data.frame()) 
colnames(lmrob_df) = c("coefficients", "variance_lmrob")
#Block D
lmrob_var_sum = sum(lmrob_var)

Not an answer, but some code to help you test this for yourself.不是答案,而是一些帮助您自己测试的代码。 I didn't run lmrob() on the full dataset, but everything I show below suggests that one full realization of the model (all observations, all predictors) should run in about 10-20 minutes [on a 10-year old MacOS desktop machine], which would extrapolate to approximately 5 hours for 30-fold cross-validation.我没有在完整数据集上运行lmrob() ,但我在下面展示的所有内容都表明模型的一个完整实现(所有观察、所有预测变量)应该在大约 10-20 分钟内运行(在 10 年前的 MacOS 桌面上)机],这将外推到大约 5 小时以进行 30 倍交叉验证。 (It looks like the time scales a little worse than the square root of the number of observations, and nonlinearly even on the log scale with the number of predictors ...) You can try the code below to see if things are much slower on your machine, and to predict how long you think it should take to do the whole problem. (看起来时间尺度比观察次数的平方根差一点,甚至在预测变量数量的对数尺度上也是非线性的......)您可以尝试下面的代码,看看事情是否慢得多你的机器,并预测你认为完成整个问题需要多长时间。 Other general suggestions:其他一般建议:

  • is there a chance you're running out of memory?你有没有可能内存不足? Memory constraints can make things run much slower内存限制会使事情运行得更
  • if the problem is just that things are too slow, you can easily parallelize across folds if you have access to multiple cores (probably don't do this on a laptop, you'll burn it up)如果问题只是速度太慢,如果您可以访问多个内核,则可以轻松地跨折叠并行化(可能不要在笔记本电脑上这样做,您会烧毁它)
  • AWS and other cloud services can be very useful AWS 和其他云服务可能非常有用

I set up a test function to record the time taken by lmrob() running on a random subset of predictors and observations from your data set.我设置了一个测试函数来记录lmrob()在您的数据集中的预测变量和观察的随机子集上运行所lmrob()的时间。

Extract data, load packages:提取数据,加载包:

unzip("superconduct.zip")
xx <- read.csv("train.csv")
library(robustbase)
library(ggplot2); theme_set(theme_bw())
library(cowplot)

Define a test function for timing lmrob runs for different numbers of observations and predictors:为不同数量的观察和预测变量定义lmrob运行时间的测试函数:

nc <- ncol(xx)  ## response vble is last column, "critical_temp"
test <- function(nobs=1000,npred=10,seed=NULL, ...) {
    if (!is.null(seed)) set.seed(seed)
    dd <- xx[sample(nrow(xx),size=nobs),
             c(sample(nc-1,size=npred),nc)]
    tt <- system.time(fit <- lmrob(critical_temp ~ ., data=dd, ...))
    tt[c("user.self","sys.self","elapsed")]
}    

t0 <- test()

The minimal example here (1000 observations, 10 predictors) is very fast (0.2 seconds).这里的最小示例(1000 个观察值,10 个预测变量)非常快(0.2 秒)。 This is the basic loop I ran:这是我运行的基本循环:

res <- expand.grid(nobs=seq(1000,10000,by=1000), npred=seq(10,30,by=2))
res$user.self <- res$sys.self <- res$elapsed <- NA
for (i in seq(nrow(res))) {
    cat(res$nobs[i],res$npred[i],"\n")
    res[i,-(1:2)] <- test(res$nobs[i],res$npred[i],seed=101)
}

(As you can see in the plot below, I did this again with larger numbers of observations and predictors and used rbind() to combine the results into a single data frame.) I also tried fitting linear models to make predictions of the time taken to do the full data set with all predictors. (如下图所示,我再次使用更多的观测值和预测变量进行了此操作,并使用rbind()将结果合并到单个数据框中。)我还尝试拟合线性模型来预测所花费的时间使用所有预测变量完成完整数据集。 (Plotting [see below] suggests that the time is log-log-linear in number of observations but nonlinear in number of predictors ...) (绘图 [见下文] 表明时间在观察数量上是对数线性的,但在预测变量数量上是非线性的......)

m1 <- lm(log10(elapsed)~poly(log10(npred),2)*log10(nobs), data=resc)
pp <- predict(m1, newdata=data.frame(npred=ncol(xx)-1,nobs=nrow(xx)),
              interval="confidence")  
10^pp  ## convert from log10(predicted seconds) to seconds

Test the full data set.测试完整的数据集。

t_all <- test(nobs=nrow(xx),npred=ncol(xx)-1)

I then realized that you were using setting = "KS2014" (as suggested in the documentation) rather than the default: this is at least 5x slower, as suggested by the following comparison:然后我意识到您使用的是setting = "KS2014" (如文档中所建议的)而不是默认值:这至少慢了 5 倍,如以下比较所示:

test(nobs=10000,npred=30)
test(nobs=10000,npred=30,setting = "KS2014")

I re-ran some of the stuff above with setting="KS2014" .我用setting="KS2014"重新运行了上面的一些东西。 Making the prediction for the full data set suggested a run-time of about 700 seconds (CI from 300 to 2000 seconds) - still nowhere near as slow as you're suggesting.对完整数据集进行预测表明运行时间约为 700 秒(CI 从 300 到 2000 秒)——仍然远没有你建议的那么慢。

gg0 <- ggplot(resc2,aes(x=npred,y=elapsed,colour=nobs,linetype=setting))+
    geom_point()+geom_line(aes(group=interaction(nobs,setting)))+
    scale_x_log10()+scale_y_log10()
gg1 <- ggplot(resc2,aes(x=nobs,y=elapsed,colour=npred, linetype=setting))+
    geom_point()+geom_line(aes(group=interaction(npred,setting)))+
    scale_x_log10()+scale_y_log10()
plot_grid(gg0,gg1,nrow=1)
ggsave("lmrob_times.pdf")

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM