[英]should I pre-install cran r packages on worker nodes when using sparkr
I want to use r packages on cran such as forecast
etc with sparkr and meet following two problems. 我想用火花等
forecast
等使用r包,并满足以下两个问题。
Should I pre-install all those packages on worker nodes? 我应该在工作节点上预安装所有这些软件包吗? But when I read the source code of spark this file , it seems that spark will automatically zip packages and distribute them to the workers via --jars or --packages.
但是当我阅读spark 这个文件的源代码时,似乎spark会自动压缩包并通过--jars或--packages将它们分发给worker。 What should I do to make the dependencies available on workers?
我该怎么做才能使工作人员可以使用依赖项?
Suppose I need to use functions provided by forecast
in a map
transformation, how should I import the package. 假设我需要在
map
转换中使用forecast
提供的函数,我该如何导入包。 Do I need to do something like following, import the package in the map function, will it make multiple import: SparkR:::map(rdd, function(x){ library(forecast) then do other staffs })
我是否需要执行以下操作,在map函数中导入包,它是否会进行多次导入:
SparkR:::map(rdd, function(x){ library(forecast) then do other staffs })
Update: 更新:
After reading more source code, it seems that, I can use includePackage
to include packages on worker nodes according to this file . 在阅读了更多的源代码后,似乎可以使用
includePackage
根据此文件在工作节点上包含包。 So now the problem becomes is it right that I have to pre-install the packages on nodes manually? 所以现在问题变成了我必须手动在节点上预安装软件包吗? And if that's true, what's the use case for --jars and --packages described in question 1?
如果这是真的,问题1中描述的--jars和--packages的用例是什么? If that's wrong, how to use --jars and --packages to install the packages?
如果这是错的,如何使用--jars和--packages来安装软件包?
It is boring to repeat this but you shouldn't use internal RDD API in the first place. 重复此操作很无聊,但您不应该首先使用内部RDD API 。 It's been removed in the first official SparkR release and it is simply not suitable for general usage.
它已在第一个官方SparkR版本中删除,它根本不适合一般用途。
Until new low level API* is ready (see for example SPARK-12922 SPARK-12919 , SPARK-12792 ) I wouldn't consider Spark as a platform for running plain R code. 在新的低级API *准备好之前(参见例如SPARK-12922 SPARK-12919 , SPARK-12792 )我不会将Spark视为运行普通R代码的平台。 Even when it changes adding native (Java / Scala) code with R wrappers can be a better choice.
即使它更改添加本机(Java / Scala)代码与R包装器可能是一个更好的选择。
That being said lets start with your question: 话虽如此,让我们从你的问题开始:
RPackageUtils
are designed to handle packages create with Spark Packages in mind. RPackageUtils
旨在处理使用Spark包创建的包。 It doesn't handle standard R libraries. Yes, you need packages to be installed on every node. 是的,您需要在每个节点上安装软件包。 From
includePackage
docstring: 来自
includePackage
docstring:
The package is assumed to be installed on every node in the Spark cluster.
假定包安装在Spark集群中的每个节点上。
* If you use Spark 2.0+ you can use dapply, gapply and lapply functions. *如果您使用Spark 2.0+,您可以使用dapply,gapply和lapply函数。
Add libraries works with spark 2.0+. 添加库与spark 2.0+一起使用。 For example, I am adding the package forecast in all node of cluster.
例如,我在群集的所有节点中添加包预测。 The code works with Spark 2.0+ and databricks environment.
该代码适用于Spark 2.0+和databricks环境。
schema <- structType(structField("out", "string"))
out <- gapply(
df,
c("p", "q"),
function(key, x)
if (!all(c("forecast") %in% (.packages()))){
if (!require("forecast")) {
install.packages("forecast", repos ="http://cran.us.r-project.org", INSTALL_opts = c('--no-lock'))
}
}
#use forecast
#dataframe out
data.frame(out = x$column, stringAsFactor = FALSE)
},
schema)
a better choice is to pass your local R package by spark-submit archive option, which means you do not need install R package in each worker and do not install and compile R package while running SparkR::dapply
for time consuming waiting. 一个更好的选择是通过spark-submit存档选项传递你的本地R包,这意味着你不需要在每个worker中安装R包,也不要在运行
SparkR::dapply
时安装和编译R包,以便耗费时间。 for example: 例如:
Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client --num-executors 40 --executor-cores 10 --executor-memory 8G --driver-memory 512M --jars /usr/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.11.1.jar --files /etc/hive/conf/hive-site.xml --archives /your_R_packages/3.5.zip --files xgboost.model sparkr-shell")
when call SparkR::dapply
function, let it call .libPaths("./3.5.zip/3.5")
first. 当调用
SparkR::dapply
函数时,首先让它调用.libPaths("./3.5.zip/3.5")
。 And you need notice that the server version R version must be equal your zip file R version. 您需要注意服务器版本R版本必须与您的zip文件R版本相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.