The code below will give you some idea about what I want to do. In reality: I'm working with imputed genetics files. Overall about 100-million SNPs (variables) imputed for several thousand people. I want to run a regression on each individual variable. Any individual model is a trivial task computationally, the problem is that I'm working with giant files and running these models 100-million times.
According to Microsoft, their HDInsight R-server is optimized for long data. The task would be much easier if I had a thousand variables and 100-million observations.
So I would like to split my giant files into several pieces. For example, split 1 dataset of a million SNPS into 10 datasets of 100,000 SNPs.
Here is the code I want to run, the last line doesn't work. Need to know how to send these 10 smaller datasets each to a different node, then run a common function. Generally I want to reproduce the mclapply() function, but instead of running it on multiple cores, run it on multiple worker nodes.
Typically the way the server works is to automatically chop up the rows into several sections, and distribute the task that way, that is a waste of resources for a few thousand observations
col <- 10000
row <- 500
df <- data.frame(matrix(rnorm(row*col),nrow=row))
caco <- sample(0:1, row, replace=T)
# The way I would do it locally for a normal dataset
fun <- function(x){
var <- df[[x]]
model <- summary(glm(caco ~ var, family="binomial"))
p <- c(x,coef(model)["var","Pr(>|z|)"])
return(p)
}
stuff <- names(df)
results <- lapply(stuff,fun)
# or
results <- mclapply(stuff,fun)
### what I want to do
# Split into several data frames
# possibly to other data manipulation, whatever is necessary
df1 <- df[,1:2000]
df2 <- df[,2001:4000]
df3 <- df[,4001:6000]
df4 <- df[,6001:8000]
df5 <- df[,8001:10000]
# I want to send each worker node one of these datasets, so each runs 2000 models
# this code does not work -
# but I think this is the general direction I want to go, using the
# rxExec function
out <- rxExec(fun, rxElemArg(stuff), execObjects=c("df1","df2","df3","df4")
Please see if the RxExec doc can help here. https://msdn.microsoft.com/en-us/microsoft-r/scaler-distributed-computing#parallel-computing-with-rxexec
Particularly this section, which demonstrates a similar case. https://msdn.microsoft.com/en-us/microsoft-r/scaler-distributed-computing#plotting-the-mandelbrot-set
For better runtime performance, the user may want to manipulate the input file directly in rxExec rather than sharing it through dataFrame object.
Let me know (xiaoyzhu at microsoft dot com) if you have further questions.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.