简体   繁体   English

使用保存的 Spark 模型评估新数据

[英]Using Saved Spark Model to Evaluate New Data

I've been successful in building a transforming my data into a LibSVM file, and training a decision tree model on it in Spark's MLlib package.我已经成功地将我的数据转换为 LibSVM 文件,并在 Spark 的 MLlib 包中训练决策树模型。 I used the Scala code in the 1.6.2 documentation , changing only the filenames:我在1.6.2 文档中使用了 Scala 代码,只更改了文件名:

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "variance"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v - p, 2) }.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression tree model:\n" + model.toDebugString)

// Save and load model
model.save(sc, "target/tmp/myDecisionTreeRegressionModel")
val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeRegressionModel")

The code correctly displays the model's MSE and learned tree model.代码正确显示模型的 MSE 和学习树模型。 However, I'm stuck in figuring out how to take the sameModel and use it to evaluate new data.但是,我一直在弄清楚如何采用sameModel并使用它来评估新数据。 Like, if the LibSVM file that I used to train the model looks like this:例如,如果我用来训练模型的 LibSVM 文件如下所示:

0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:9 16:19
0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:1 13:0 14:0 15:9 16:12
0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:6 16:7

How do I feed the trained model something like this, and have it predict the label?我如何为训练有素的模型提供这样的信息,并让它预测标签?

1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:9 16:19
1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:1 13:0 14:0 15:9 16:12
1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:6 16:7

EDIT (8/31/2017 3:56 PM, Eastern)编辑(2017 年 8 月 31 日下午 3:56,东部)

Per the below suggestions, I'm trying the predict function, but it doesn't look like the code is quite right:根据以下建议,我正在尝试预测功能,但看起来代码不太正确:

val new_data = MLUtils.loadLibSVMFile(sc, "hdfs://.../new_data/*")

val labelsAndPredictions = new_data.map { point =>
  val prediction = sameModel.predict(point.features)
  (point.label, prediction)
}

labelsAndPredictions.take(10)

If I run this with a LibSVM file containing '1' values as the label (I'm testing with ten new rows in the file), then they all come back as '1.0' in the labelsAndPredictions.take(10) command.如果我使用包含 '1' 值作为标签的 LibSVM 文件运行它(我正在用文件中的十个新行进行测试),那么它们在labelsAndPredictions.take(10)命令中都返回为 '1.0'。 If I give it a '0' value, then they all come back as '0.0', so it doesn't seem like anything's being predicted properly.如果我给它一个 '0' 值,那么它们都会返回为 '0.0',所以它似乎没有正确预测任何东西。

The load method should return a model. load 方法应该返回一个模型。 Then call predict with either a RDD[Vector] or a single Vector.然后使用 RDD[Vector] 或单个 Vector 调用predict

  1. load raw data (as you did above, a similar LibSVM file)加载原始数据(如上所述,类似的 LibSVM 文件)
  2. provide information about categorical features提供有关分类特征的信息
  3. for each point in above data make predictions by calling: savedModel.predict(point.features)对于上述数据中的每个点,通过调用:savedModel.predict(point.features) 进行预测

You can load ML model from the disk via Pipeline :您可以通过Pipeline从磁盘加载 ML 模型:

import org.apache.spark.ml._
val pipeline = Pipeline.read.load("sample-pipeline")

scala> val stageCount = pipeline.getStages.size
stageCount: Int = 0

val pipelineModel = PipelineModel.read.load("sample-model")

scala> pipelineModel.stages

After obtaining pipeline use can make evaluation on the dataset:获得pipeline后可以对数据集进行评估:

val model = pipeline.fit(dataset)
val predictions = model.transform(dataset)

And you must use proper Evaluator eg RegressionEvaluator .并且您必须使用适当的Evaluator例如RegressionEvaluator Evaluator works on datasets with predictions: Evaluator 处理具有预测的数据集:

import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator
println(regEval.explainParams)
regEval.evaluate(predictions)

UPD If you have deal with hdfs you can easily load/save model: UPD如果您处理过hdfs您可以轻松加载/保存模型:

One way to save a model to HDFS is as following:将模型保存到 HDFS 的一种方法如下:

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/sample-model")

Saved model can then be loaded as:然后可以将保存的模型加载为:

val linRegModel = sc.objectFile[LinearRegressionModel]("hdfs:///user/root/sample-model").first()
linRegModel.predict(Vectors.dense(11.0, 2.0, 2.0, 1.0, 2200.0))

or like in example above, but instead local file hdfs :或者像上面的例子一样,而是本地文件hdfs

PipelineModel.read.load("hdfs:///user/root/sample-model")

Ship the file with hdfs to a directory where all nodes can see it in the cluster.将带有 hdfs 的文件传送到集群中所有节点都可以看到的目录。 In your code load it and predict.在您的代码中加载它并进行预测。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM