我应该在使用sparkr时在工作节点上预先安装cran r包

Question

I want to use r packages on cran such as forecast etc with sparkr and meet following two problems. 我想用火花等forecast等使用r包，并满足以下两个问题。

Should I pre-install all those packages on worker nodes? 我应该在工作节点上预安装所有这些软件包吗？ But when I read the source code of spark this file , it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages. 但是当我阅读spark 这个文件的源代码时，似乎spark会自动压缩包并通过--jars或--packages将它们分发给worker。 What should I do to make the dependencies available on workers? 我该怎么做才能使工作人员可以使用依赖项？
Suppose I need to use functions provided by forecast in a map transformation, how should I import the package. 假设我需要在map转换中使用forecast提供的函数，我该如何导入包。 Do I need to do something like following, import the package in the map function, will it make multiple import: SparkR:::map(rdd, function(x){ library(forecast) then do other staffs }) 我是否需要执行以下操作，在map函数中导入包，它是否会进行多次导入： SparkR:::map(rdd, function(x){ library(forecast) then do other staffs })

Update: 更新：

After reading more source code, it seems that, I can use includePackage to include packages on worker nodes according to this file . 在阅读了更多的源代码后，似乎可以使用includePackage根据此文件在工作节点上包含包。 So now the problem becomes is it right that I have to pre-install the packages on nodes manually? 所以现在问题变成了我必须手动在节点上预安装软件包吗？ And if that's true, what's the use case for --jars and --packages described in question 1? 如果这是真的，问题1中描述的--jars和--packages的用例是什么？ If that's wrong, how to use --jars and --packages to install the packages? 如果这是错的，如何使用--jars和--packages来安装软件包？

Answer 1

It is boring to repeat this but you shouldn't use internal RDD API in the first place. 重复此操作很无聊，但您不应该首先使用内部RDD API 。 It's been removed in the first official SparkR release and it is simply not suitable for general usage. 它已在第一个官方SparkR版本中删除，它根本不适合一般用途。

Until new low level API* is ready (see for example SPARK-12922 SPARK-12919 , SPARK-12792 ) I wouldn't consider Spark as a platform for running plain R code. 在新的低级API *准备好之前（参见例如SPARK-12922 SPARK-12919 ， SPARK-12792 ）我不会将Spark视为运行普通R代码的平台。 Even when it changes adding native (Java / Scala) code with R wrappers can be a better choice. 即使它更改添加本机（Java / Scala）代码与R包装器可能是一个更好的选择。

That being said lets start with your question: 话虽如此，让我们从你的问题开始：

RPackageUtils are designed to handle packages create with Spark Packages in mind. RPackageUtils旨在处理使用Spark包创建的包。 It doesn't handle standard R libraries. 它不处理标准R库。
Yes, you need packages to be installed on every node. 是的，您需要在每个节点上安装软件包。 From includePackage docstring: 来自includePackage docstring：

The package is assumed to be installed on every node in the Spark cluster. 假定包安装在Spark集群中的每个节点上。

* If you use Spark 2.0+ you can use dapply, gapply and lapply functions. *如果您使用Spark 2.0+，您可以使用dapply，gapply和lapply函数。

Answer 2

Add libraries works with spark 2.0+. 添加库与spark 2.0+一起使用。 For example, I am adding the package forecast in all node of cluster. 例如，我在群集的所有节点中添加包预测。 The code works with Spark 2.0+ and databricks environment. 该代码适用于Spark 2.0+和databricks环境。

schema <- structType(structField("out", "string"))
out <- gapply(
  df,
  c("p", "q"),
  function(key, x) 
  if (!all(c("forecast") %in% (.packages()))){
     if (!require("forecast")) {
        install.packages("forecast", repos ="http://cran.us.r-project.org", INSTALL_opts = c('--no-lock'))
     }
  }  
  #use forecast
  #dataframe out
  data.frame(out = x$column, stringAsFactor = FALSE)
}, 
schema)

Answer 3

a better choice is to pass your local R package by spark-submit archive option, which means you do not need install R package in each worker and do not install and compile R package while running SparkR::dapply for time consuming waiting. 一个更好的选择是通过spark-submit存档选项传递你的本地R包，这意味着你不需要在每个worker中安装R包，也不要在运行SparkR::dapply时安装和编译R包，以便耗费时间。 for example: 例如：

Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client --num-executors 40 --executor-cores 10 --executor-memory 8G --driver-memory 512M --jars /usr/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.11.1.jar --files /etc/hive/conf/hive-site.xml --archives /your_R_packages/3.5.zip --files xgboost.model sparkr-shell")

when call SparkR::dapply function, let it call .libPaths("./3.5.zip/3.5") first. 当调用SparkR::dapply函数时，首先让它调用.libPaths("./3.5.zip/3.5") 。 And you need notice that the server version R version must be equal your zip file R version. 您需要注意服务器版本R版本必须与您的zip文件R版本相同。

我应该在使用sparkr时在工作节点上预先安装cran r包

问题描述

3 个解决方案

解决方案1
2 已采纳 2016-03-16 01:00:38

解决方案2
0 2017-03-12 15:22:56

解决方案3
-1 2018-10-30 10:47:59

我应该在使用sparkr时在工作节点上预先安装cran r包

问题描述

3 个解决方案

解决方案1 2 已采纳 2016-03-16 01:00:38

解决方案2 0 2017-03-12 15:22:56

解决方案3 -1 2018-10-30 10:47:59

解决方案1
2 已采纳 2016-03-16 01:00:38

解决方案2
0 2017-03-12 15:22:56

解决方案3
-1 2018-10-30 10:47:59