Estimating an OLS model in R with million observations and thousands of variables

Question

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.

I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.

However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.

I've already tried:

starting R with --max-ppsize=50000
setting options(expressions = 50000)

but the error persists

I am working on Windows and using Rstudio

# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))

# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]

# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))

EDIT 1: The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.

Answer 1

The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However : the next bottleneck is that the internal function biglm:::bigqr.init() , which gets called inside biglm , tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get

Error: cannot allocate vector of size 9.3 Gb

(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000 .

I have a few caveats about this approach:

you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you . When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA . I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (eg X <- cbind(1,X) if you want one.

Example (first save the code at the bottom as my_biglm.R ):

nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))

respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)

# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))

source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))

my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
                      y=NULL, X=NULL, off=0) {
    if (!is.null(weights)) {
        if (!inherits(weights, "formula")) 
            stop("`weights' must be a formula")
        w <- model.frame(weights, data)[[1]]
    } else w <- NULL
    if (is.null(X)) {
        tt <- terms(formula)
        mf <- model.frame(tt, data)
        if (is.null(off <- model.offset(mf))) 
            off <- 0
        mm <- model.matrix(tt, mf)
        y <- model.response(mf) - off
    } else {
        ## model matrix specified directly
        if (is.null(y)) stop("both y and X must be specified")
        mm <- X
        tt <- NULL
    }
    qr <- biglm:::bigqr.init(NCOL(mm))
    qr <- biglm:::update.bigqr(qr, mm, y, w)
    rval <- list(call = sys.call(), qr = qr, assign = attr(mm, 
        "assign"), terms = tt, n = NROW(mm), names = colnames(mm), 
        weights = weights)
    if (sandwich) {
        p <- ncol(mm)
        n <- nrow(mm)
        xyqr <- bigqr.init(p * (p + 1))
        xx <- matrix(nrow = n, ncol = p * (p + 1))
        xx[, 1:p] <- mm * y
        for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
        xyqr <- update(xyqr, xx, rep(0, n), w * w)
        rval$sandwich <- list(xy = xyqr)
    }
    rval$df.resid <- rval$n - length(qr$D)
    class(rval) <- "biglm"
    rval
}

Estimating an OLS model in R with million observations and thousands of variables

Question

1 answers

solution1
0 ACCPTED 2019-04-05 00:21:50

Estimating an OLS model in R with million observations and thousands of variables

Question

1 answers

solution1 0 ACCPTED 2019-04-05 00:21:50

solution1
0 ACCPTED 2019-04-05 00:21:50