如何从RDD创建Spark数据集

Question

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. 我有一个RDD[LabeledPoint]旨在用于机器学习管道。 How do we convert that RDD to a DataSet ? 我们如何将RDD转换为DataSet ？ Note the newer spark.ml apis require inputs in the Dataset format. 请注意，较新的spark.ml apis需要Dataset格式的输入。

Answer 1

Here is an answer that traverses an extra step - the DataFrame . 这是一个遍历额外步骤的答案 - DataFrame 。 We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint : 我们使用SQLContext创建一个DataFrame ，然后使用所需的对象类型创建一个DataSet - 在本例中为LabeledPoint ：

val sqlContext = new SQLContext(sc)
val pointsTrainDf =  sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]

Update Ever heard of a SparkSession ? 更新曾经听说过SparkSession ？ (neither had I until now..) （直到现在我都没...）

So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward. 显然SparkSession是Spark 2.0.0中的Preferred Way （TM）并且向前发展。 Here is the updated code for the new (spark) world order: 以下是新（火花）世界顺序的更新代码：

Spark 2.0.0+ approaches Spark 2.0.0+方法

Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame . 请注意，在以下两种方法中（较简单的一种方法是@ zero323），与SQLContext方法相比，我们已经完成了一项重要的节省：不再需要首先创建一个DataFrame 。

val sparkSession =  SparkSession.builder().getOrCreate()
val pointsTrainDf =  sparkSession.createDataset(training)
val model = new LogisticRegression()
   .train(pointsTrainDs.as[LabeledPoint])

Second way for Spark 2.0.0+ Credit to @zero323 Spark 2.0.0+的第二种方式是 @ zero323

val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val trainDs = training.toDS()

Traditional Spark 1.X and earlier approach 传统的Spark 1.X和更早的方法

val sqlContext = new SQLContext(sc)  // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**

See also: How to store custom objects in Dataset? 另请参阅：如何在Dataset中存储自定义对象？ by the esteemed @zero323 . 受到尊敬的@ zero323。

如何从RDD创建Spark数据集

问题描述

1 个解决方案

解决方案1
18 已采纳 2016-05-29 19:05:18

如何从RDD创建Spark数据集

问题描述

1 个解决方案

解决方案1 18 已采纳 2016-05-29 19:05:18

解决方案1
18 已采纳 2016-05-29 19:05:18