简体繁体 English

使用SparkR运行R模型

[英]Run a R Model using SparkR

原文 2017-11-14 08:40:25 7 1 r/ apache-spark-mllib/ sparkr

Thanks in advance for your input. 提前感谢您的意见。 I am a newbie to ML. 我是ML的新手。 I've developed a R model (using R studio on my local) and want to deploy on the hadoop cluster having R Studio installed. 我开发了一个R模型（在我的本地使用R studio），并希望在安装了R Studio的hadoop集群上进行部署。 I want to use SparkR to leverage high performance computing. 我想使用SparkR来利用高性能计算。 I just want to understand the role of SparkR here. 我只想在这里了解SparkR的作用。

Will SparkR enable the R model to run the algorithm within Spark ML on the Hadoop Cluster? SparkR会使R模型在Hadoop集群上的Spark ML中运行算法吗？

OR 要么

Will SparkR enable only the data processing and still the ML algorithm will run within the context of R on the Hadoop Cluster? SparkR只能进行数据处理，而ML算法仍然会在Hadoop集群的R上下文中运行吗？

Appreciate your input. 感谢您的意见。

1 个解决方案

These are general questions, but they actually have a very simple & straightforward answer: no (to both); 这些是一般性问题，但它们实际上有一个非常简单明了的答案：否（对两者而言）; SparkR wiil do neither. SparkR wiil都没有。

From the Overview section of the SparkR docs : 从SparkR文档的Overview部分：

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR是一个R包，它提供了一个轻量级的前端来使用来自R的Apache Spark。

SparkR cannot even read native R models. SparkR甚至无法读取原生R模型。

The idea behind using SparkR for ML tasks is that you develop your model specifically in SparkR (and if you try, you'll also discover that it is much more limited in comparison to the plethora of models available in R through the various packages). 使用SparkR进行ML任务背后的想法是你专门在SparkR中开发你的模型（如果你尝试，你也会发现它与R中通过各种软件包提供的众多模型相比更加有限）。

Even conveniences like, say, confusionMatrix from the caret package, are not available, since they operate on R dataframes and not on Spark ones (see this question & answer ). 甚至像caret包中的confusionMatrix这样的便利也不可用，因为它们在R数据帧上运行而不是在Spark数据帧上运行（参见这个问题和答案）。