简体   繁体   English

我应该在使用sparkr时在工作节点上预先安装cran r包

[英]should I pre-install cran r packages on worker nodes when using sparkr

I want to use r packages on cran such as forecast etc with sparkr and meet following two problems. 我想用火花等forecast等使用r包,并满足以下两个问题。

  1. Should I pre-install all those packages on worker nodes? 我应该在工作节点上预安装所有这些软件包吗? But when I read the source code of spark this file , it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages. 但是当我阅读spark 这个文件的源代码时,似乎spark会自动压缩包并通过--jars或--packages将它们分发给worker。 What should I do to make the dependencies available on workers? 我该怎么做才能使工作人员可以使用依赖项?

  2. Suppose I need to use functions provided by forecast in a map transformation, how should I import the package. 假设我需要在map转换中使用forecast提供的函数,我该如何导入包。 Do I need to do something like following, import the package in the map function, will it make multiple import: SparkR:::map(rdd, function(x){ library(forecast) then do other staffs }) 我是否需要执行以下操作,在map函数中导入包,它是否会进行多次导入: SparkR:::map(rdd, function(x){ library(forecast) then do other staffs })

Update: 更新:

After reading more source code, it seems that, I can use includePackage to include packages on worker nodes according to this file . 在阅读了更多的源代码后,似乎可以使用includePackage根据此文件在工作节点上包含包。 So now the problem becomes is it right that I have to pre-install the packages on nodes manually? 所以现在问题变成了我必须手动在节点上预安装软件包吗? And if that's true, what's the use case for --jars and --packages described in question 1? 如果这是真的,问题1中描述的--jars和--packages的用例是什么? If that's wrong, how to use --jars and --packages to install the packages? 如果这是错的,如何使用--jars和--packages来安装软件包?

It is boring to repeat this but you shouldn't use internal RDD API in the first place. 重复此操作很无聊,但您不应该首先使用内部RDD API It's been removed in the first official SparkR release and it is simply not suitable for general usage. 它已在第一个官方SparkR版本中删除,它根本不适合一般用途。

Until new low level API* is ready (see for example SPARK-12922 SPARK-12919 , SPARK-12792 ) I wouldn't consider Spark as a platform for running plain R code. 在新的低级API *准备好之前(参见例如SPARK-12922 SPARK-12919SPARK-12792 )我不会将Spark视为运行普通R代码的平台。 Even when it changes adding native (Java / Scala) code with R wrappers can be a better choice. 即使它更改添加本机(Java / Scala)代码与R包装器可能是一个更好的选择。

That being said lets start with your question: 话虽如此,让我们从你的问题开始:

  1. RPackageUtils are designed to handle packages create with Spark Packages in mind. RPackageUtils旨在处理使用Spark包创建的包。 It doesn't handle standard R libraries. 它不处理标准R库。
  2. Yes, you need packages to be installed on every node. 是的,您需要在每个节点上安装软件包。 From includePackage docstring: 来自includePackage docstring:

    The package is assumed to be installed on every node in the Spark cluster. 假定包安装在Spark集群中的每个节点上。


* If you use Spark 2.0+ you can use dapply, gapply and lapply functions. *如果您使用Spark 2.0+,您可以使用dapply,gapply和lapply函数。

Add libraries works with spark 2.0+. 添加库与spark 2.0+一起使用。 For example, I am adding the package forecast in all node of cluster. 例如,我在群集的所有节点中添加包预测。 The code works with Spark 2.0+ and databricks environment. 该代码适用于Spark 2.0+和databricks环境。

schema <- structType(structField("out", "string"))
out <- gapply(
  df,
  c("p", "q"),
  function(key, x) 
  if (!all(c("forecast") %in% (.packages()))){
     if (!require("forecast")) {
        install.packages("forecast", repos ="http://cran.us.r-project.org", INSTALL_opts = c('--no-lock'))
     }
  }  
  #use forecast
  #dataframe out
  data.frame(out = x$column, stringAsFactor = FALSE)
}, 
schema)

a better choice is to pass your local R package by spark-submit archive option, which means you do not need install R package in each worker and do not install and compile R package while running SparkR::dapply for time consuming waiting. 一个更好的选择是通过spark-submit存档选项传递你的本地R包,这意味着你不需要在每个worker中安装R包,也不要在运行SparkR::dapply时安装和编译R包,以便耗费时间。 for example: 例如:

Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client --num-executors 40 --executor-cores 10 --executor-memory 8G --driver-memory 512M --jars /usr/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.11.1.jar --files /etc/hive/conf/hive-site.xml --archives /your_R_packages/3.5.zip --files xgboost.model sparkr-shell")

when call SparkR::dapply function, let it call .libPaths("./3.5.zip/3.5") first. 当调用SparkR::dapply函数时,首先让它调用.libPaths("./3.5.zip/3.5") And you need notice that the server version R version must be equal your zip file R version. 您需要注意服务器版本R版本必须与您的zip文件R版本相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM