简体   繁体   English

R中的并行或降雪程序包可以与火花群集交互吗?

[英]Can the parallel or snow packages in R interface with a spark cluster?

I am dealing with a computationally intensive package in R. This package has no alternative implementations that interfaces with a Spark cluster;however, it does have an optional argument to take in a cluster created with the parallel package. 我正在处理R中的计算密集型程序包。此程序包没有与Spark集群接口的替代实现;但是,它确实具有可选参数,以接受由并行程序包创建的集群。 My question is can I connect to a spark cluster using something like SparklyR, and then use that spark cluster as part of a makeCluster command to pass into my function? 我的问题是我可以使用SparklyR之类的东西连接到Spark集群,然后将其作为makeCluster命令的一部分传递到我的函数中吗?

I have successfully gotten the cluster working with parallel, but I do not know how or if it is possible to leverage the spark clusters. 我已经成功地使集群与并行工作,但是我不知道如何或是否有可能利用Spark集群。

library(bnlearn)
library(parallel)

my_cluster <- makeCluster(3)
...
pc_structure <- pc.stable(train[,-1], cluster = my_cluster)

My question is can I connect to a spark cluster as follows: 我的问题是我可以按以下方式连接到Spark集群:

sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.2')

and then leverage the connection (the sc object) in the makeCluster() function? 然后利用makeCluster()函数中的连接(sc对象)?

If that would solve your problem (and if I understand you correctly), I'd wrap your code that uses parallel package into a sparkR function, eg spark.lapply (or something similar in sparklyr, don't have experience with that). 如果这可以解决您的问题(并且我理解正确),则将使用并行包的代码包装到sparkR函数中,例如spark.lapply (或sparklyr中类似的东西,对此没有经验)。

I assume your Spark cluster is Linux based, hence the mcapply function from the parallel package should be used (instead of makeCluster and consequent clusterExport on Windows). 我假设您的Spark集群基于Linux,因此应使用并行包中的mcapply函数(而不是makeCluster和随后的Windows上的clusterExport )。

For example a locally executed task of summing up numbers in each element of a list would be (on Linux): 例如,一个本地执行的任务是对列表的每个元素进行数字求和(在Linux上):

library(parallel)
input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
res = mclapply(X=input, FUN=sum, mc.cores=3)

and doing the same task 10000 times using a Spark cluster: 并使用Spark集群执行相同的任务10000次:

input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
save(input, file="/path/testData.RData")

res = spark.lapply(1:10000, function(x){
                    library(parallel)
                   load("/path/testData.RData")
                    mclapply(X=input, FUN=sum, mc.cores=3)
                    })

Question is whether your code be tweaked that way. 问题是您的代码是否可以通过这种方式进行调整。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM