简体   繁体   中英

How to use the bigstatsr R package using two datasets for estimating the parameters?

I have independent and dependent datasets. I want to test all possible relationships between dependent and independent variables. In my previous post ( How to replicate a function using mapply with multiple arguments to calculate the power of a method? ), I wanted to do power analysis using simulation data. Now, I want to analyze real data using the same function. The problem is that the test_function needed more time as my dataset is big (dimension of each data set greater than 10000 X 40000). Also, I want to use parallel computing to speed up the calculation. I have found that the bigstatsr package ( https://privefl.github.io/bigstatsr/index.html ) can handle matrices that are too large to fit in memory. Moreover, I want to avoid expand.grid as it is also computationally expensive for big data. I did not find any post that can use two datasets simultaneously using the bigstatsr package and estimate parameters parallelly. Datasets examples and code are given below:


# dependent dataset
test_A <- data.frame(matrix(rnorm(100), nr=10, nc=10))
# independent dataset
test_B <- data.frame(matrix(sample(c(0,1,2), 500, replace = TRUE), nr=50, nc=10))
# Find all combination using dependent and independe datasets's variables
A_B_pair <- subset(expand.grid(c1=names(test_A), c2=names(test_B), 
                               stringsAsFactors = FALSE))
# Main function to estimate the parameter and p-values 
test_function <- function(test_A, test_B, x,y){
  c1 <- test_A [[x]]
  c2 <- test_B[[y]]
  Data <- data.frame(1, XX=c1, YY=c2)
  
  model_lm <- lm(YY ~ XX, Data)
  est_lm <- as.numeric(model_lm$coefficients)[2]
  pvalue_lm <- as.numeric(summary(model_lm)$coeffi[,4][2])
  
  return(unlist(data.frame(lm.estimator = est_lm, lm.pvalue =pvalue_lm)))
}
# Final output
output <- mapply(test_function, MoreArgs = list(test_A, test_B),
                 x = A_B_pair$c1, y = A_B_pair$c2)

How can I apply bigstatsr and parallelly compute this function to get the outputs? Thank you so much for your effort and help.

I don't think there is really a problem of size here (memory-wise), but just a computation time problem.

I think you just want to do some univariate testing. For that, you can use function big_univLinReg :

library(bigstatsr)
X <- as_FBM(test_B)
NCORES <- nb_cores()

k <- 1  ## replace by loop here
stats <- big_univLinReg(X, test_A[[k]], ncores = NCORES)
pval <- predict(stats, log10 = FALSE)

This function should be quite fast, and gives you all the coefficients for all variables in test_B . Then you only need to loop over the variables in test_A .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM