简体   繁体   English

Microsoft Azure HDInsight上的R服务器 - 处理非常广泛的数据。 - rxExec?

[英]R-server on Microsoft Azure HDInsight - dealing with very wide data. - rxExec?

The code below will give you some idea about what I want to do. 下面的代码将让您了解我想要做什么。 In reality: I'm working with imputed genetics files. 实际上:我正在使用估算的遗传学文件。 Overall about 100-million SNPs (variables) imputed for several thousand people. 总共约有1亿个SNP(变量)被估算为数千人。 I want to run a regression on each individual variable. 我想对每个变量运行回归。 Any individual model is a trivial task computationally, the problem is that I'm working with giant files and running these models 100-million times. 任何单个模型在计算上都是一项微不足道的任务,问题是我正在使用巨型文件并运行这些模型1亿次。

According to Microsoft, their HDInsight R-server is optimized for long data. 根据微软的说法,他们的HDInsight R服务器针对长数据进行了优化。 The task would be much easier if I had a thousand variables and 100-million observations. 如果我有一千个变量和一亿次观察,那么任务就会容易得多。

So I would like to split my giant files into several pieces. 所以我想将我的巨型文件分成几部分。 For example, split 1 dataset of a million SNPS into 10 datasets of 100,000 SNPs. 例如,将1百万个SNPS的1个数据集分成10个100,000个SNP的数据集。

Here is the code I want to run, the last line doesn't work. 这是我想要运行的代码,最后一行不起作用。 Need to know how to send these 10 smaller datasets each to a different node, then run a common function. 需要知道如何将这10个较小的数据集分别发送到不同的节点,然后运行一个通用函数。 Generally I want to reproduce the mclapply() function, but instead of running it on multiple cores, run it on multiple worker nodes. 通常我想重现mclapply()函数,但不是在多个核心上运行它,而是在多个工作节点上运行它。

Typically the way the server works is to automatically chop up the rows into several sections, and distribute the task that way, that is a waste of resources for a few thousand observations 通常,服务器的工作方式是自动将行分成几个部分,并以这种方式分配任务,这样就浪费了几千个观察资源

col <- 10000
row <- 500

df <- data.frame(matrix(rnorm(row*col),nrow=row))
caco <- sample(0:1, row, replace=T)



# The way I would do it locally for a normal dataset


fun <- function(x){
  var <- df[[x]]
  model <- summary(glm(caco ~ var, family="binomial"))
  p <- c(x,coef(model)["var","Pr(>|z|)"])
  return(p)
}

stuff <- names(df)
results <- lapply(stuff,fun) 
# or
results <- mclapply(stuff,fun)



### what I want to do

# Split into several data frames
# possibly to other data manipulation, whatever is necessary

df1 <- df[,1:2000]
df2 <- df[,2001:4000]
df3 <- df[,4001:6000]
df4 <- df[,6001:8000]
df5 <- df[,8001:10000]

# I want to send each worker node one of these datasets, so each runs 2000 models

# this code does not work - 
# but I think this is the general direction I want to go, using the 
# rxExec function

out <- rxExec(fun, rxElemArg(stuff), execObjects=c("df1","df2","df3","df4")

Please see if the RxExec doc can help here. 请查看RxExec文档是否可以在此处提供帮助。 https://msdn.microsoft.com/en-us/microsoft-r/scaler-distributed-computing#parallel-computing-with-rxexec https://msdn.microsoft.com/en-us/microsoft-r/scaler-distributed-computing#parallel-computing-with-rxexec

Particularly this section, which demonstrates a similar case. 特别是这一部分,它展示了类似的案例。 https://msdn.microsoft.com/en-us/microsoft-r/scaler-distributed-computing#plotting-the-mandelbrot-set https://msdn.microsoft.com/en-us/microsoft-r/scaler-distributed-computing#plotting-the-mandelbrot-set

For better runtime performance, the user may want to manipulate the input file directly in rxExec rather than sharing it through dataFrame object. 为了获得更好的运行时性能,用户可能希望直接在rxExec中操作输入文件,而不是通过dataFrame对象共享它。

Let me know (xiaoyzhu at microsoft dot com) if you have further questions. 如果您有其他问题,请告诉我(微软网络公司的xiaoyzhu)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM