[英]Can the parallel or snow packages in R interface with a spark cluster?
I am dealing with a computationally intensive package in R. This package has no alternative implementations that interfaces with a Spark cluster;however, it does have an optional argument to take in a cluster created with the parallel package. 我正在处理R中的计算密集型程序包。此程序包没有与Spark集群接口的替代实现;但是,它确实具有可选参数,以接受由并行程序包创建的集群。 My question is can I connect to a spark cluster using something like SparklyR, and then use that spark cluster as part of a makeCluster command to pass into my function?
我的问题是我可以使用SparklyR之类的东西连接到Spark集群,然后将其作为makeCluster命令的一部分传递到我的函数中吗?
I have successfully gotten the cluster working with parallel, but I do not know how or if it is possible to leverage the spark clusters. 我已经成功地使集群与并行工作,但是我不知道如何或是否有可能利用Spark集群。
library(bnlearn)
library(parallel)
my_cluster <- makeCluster(3)
...
pc_structure <- pc.stable(train[,-1], cluster = my_cluster)
My question is can I connect to a spark cluster as follows: 我的问题是我可以按以下方式连接到Spark集群:
sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.2')
and then leverage the connection (the sc object) in the makeCluster() function? 然后利用makeCluster()函数中的连接(sc对象)?
If that would solve your problem (and if I understand you correctly), I'd wrap your code that uses parallel package into a sparkR function, eg spark.lapply
(or something similar in sparklyr, don't have experience with that). 如果这可以解决您的问题(并且我理解正确),则将使用并行包的代码包装到sparkR函数中,例如
spark.lapply
(或sparklyr中类似的东西,对此没有经验)。
I assume your Spark cluster is Linux based, hence the mcapply
function from the parallel package should be used (instead of makeCluster
and consequent clusterExport
on Windows). 我假设您的Spark集群基于Linux,因此应使用并行包中的
mcapply
函数(而不是makeCluster
和随后的Windows上的clusterExport
)。
For example a locally executed task of summing up numbers in each element of a list would be (on Linux): 例如,一个本地执行的任务是对列表的每个元素进行数字求和(在Linux上):
library(parallel)
input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
res = mclapply(X=input, FUN=sum, mc.cores=3)
and doing the same task 10000 times using a Spark cluster: 并使用Spark集群执行相同的任务10000次:
input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
save(input, file="/path/testData.RData")
res = spark.lapply(1:10000, function(x){
library(parallel)
load("/path/testData.RData")
mclapply(X=input, FUN=sum, mc.cores=3)
})
Question is whether your code be tweaked that way. 问题是您的代码是否可以通过这种方式进行调整。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.