简体   繁体   中英

R-server on Microsoft Azure HDInsight - dealing with very wide data. - rxExec?

The code below will give you some idea about what I want to do. In reality: I'm working with imputed genetics files. Overall about 100-million SNPs (variables) imputed for several thousand people. I want to run a regression on each individual variable. Any individual model is a trivial task computationally, the problem is that I'm working with giant files and running these models 100-million times.

According to Microsoft, their HDInsight R-server is optimized for long data. The task would be much easier if I had a thousand variables and 100-million observations.

So I would like to split my giant files into several pieces. For example, split 1 dataset of a million SNPS into 10 datasets of 100,000 SNPs.

Here is the code I want to run, the last line doesn't work. Need to know how to send these 10 smaller datasets each to a different node, then run a common function. Generally I want to reproduce the mclapply() function, but instead of running it on multiple cores, run it on multiple worker nodes.

Typically the way the server works is to automatically chop up the rows into several sections, and distribute the task that way, that is a waste of resources for a few thousand observations

col <- 10000
row <- 500

df <- data.frame(matrix(rnorm(row*col),nrow=row))
caco <- sample(0:1, row, replace=T)



# The way I would do it locally for a normal dataset


fun <- function(x){
  var <- df[[x]]
  model <- summary(glm(caco ~ var, family="binomial"))
  p <- c(x,coef(model)["var","Pr(>|z|)"])
  return(p)
}

stuff <- names(df)
results <- lapply(stuff,fun) 
# or
results <- mclapply(stuff,fun)



### what I want to do

# Split into several data frames
# possibly to other data manipulation, whatever is necessary

df1 <- df[,1:2000]
df2 <- df[,2001:4000]
df3 <- df[,4001:6000]
df4 <- df[,6001:8000]
df5 <- df[,8001:10000]

# I want to send each worker node one of these datasets, so each runs 2000 models

# this code does not work - 
# but I think this is the general direction I want to go, using the 
# rxExec function

out <- rxExec(fun, rxElemArg(stuff), execObjects=c("df1","df2","df3","df4")

Please see if the RxExec doc can help here. https://msdn.microsoft.com/en-us/microsoft-r/scaler-distributed-computing#parallel-computing-with-rxexec

Particularly this section, which demonstrates a similar case. https://msdn.microsoft.com/en-us/microsoft-r/scaler-distributed-computing#plotting-the-mandelbrot-set

For better runtime performance, the user may want to manipulate the input file directly in rxExec rather than sharing it through dataFrame object.

Let me know (xiaoyzhu at microsoft dot com) if you have further questions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM