R中的并行或降雪程序包可以与火花群集交互吗？

Question

I am dealing with a computationally intensive package in R. This package has no alternative implementations that interfaces with a Spark cluster;however, it does have an optional argument to take in a cluster created with the parallel package. 我正在处理R中的计算密集型程序包。此程序包没有与Spark集群接口的替代实现；但是，它确实具有可选参数，以接受由并行程序包创建的集群。 My question is can I connect to a spark cluster using something like SparklyR, and then use that spark cluster as part of a makeCluster command to pass into my function? 我的问题是我可以使用SparklyR之类的东西连接到Spark集群，然后将其作为makeCluster命令的一部分传递到我的函数中吗？

I have successfully gotten the cluster working with parallel, but I do not know how or if it is possible to leverage the spark clusters. 我已经成功地使集群与并行工作，但是我不知道如何或是否有可能利用Spark集群。

library(bnlearn)
library(parallel)

my_cluster <- makeCluster(3)
...
pc_structure <- pc.stable(train[,-1], cluster = my_cluster)

My question is can I connect to a spark cluster as follows: 我的问题是我可以按以下方式连接到Spark集群：

sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.2')

and then leverage the connection (the sc object) in the makeCluster() function? 然后利用makeCluster（）函数中的连接（sc对象）？

Answer 1

If that would solve your problem (and if I understand you correctly), I'd wrap your code that uses parallel package into a sparkR function, eg spark.lapply (or something similar in sparklyr, don't have experience with that). 如果这可以解决您的问题（并且我理解正确），则将使用并行包的代码包装到sparkR函数中，例如spark.lapply （或sparklyr中类似的东西，对此没有经验）。

I assume your Spark cluster is Linux based, hence the mcapply function from the parallel package should be used (instead of makeCluster and consequent clusterExport on Windows). 我假设您的Spark集群基于Linux，因此应使用并行包中的mcapply函数（而不是makeCluster和随后的Windows上的clusterExport ）。

For example a locally executed task of summing up numbers in each element of a list would be (on Linux): 例如，一个本地执行的任务是对列表的每个元素进行数字求和（在Linux上）：

library(parallel)
input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
res = mclapply(X=input, FUN=sum, mc.cores=3)

and doing the same task 10000 times using a Spark cluster: 并使用Spark集群执行相同的任务10000次：

input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
save(input, file="/path/testData.RData")

res = spark.lapply(1:10000, function(x){
                    library(parallel)
                   load("/path/testData.RData")
                    mclapply(X=input, FUN=sum, mc.cores=3)
                    })

Question is whether your code be tweaked that way. 问题是您的代码是否可以通过这种方式进行调整。

R中的并行或降雪程序包可以与火花群集交互吗？

问题描述

1 个解决方案

解决方案1
0 2019-05-20 14:10:52

R中的并行或降雪程序包可以与火花群集交互吗？

问题描述

1 个解决方案

解决方案1 0 2019-05-20 14:10:52

解决方案1
0 2019-05-20 14:10:52