如何将模型从ML Pipeline保存到S3或HDFS？

Question

I am trying to save thousands of models produced by ML Pipeline. 我正在努力保存ML Pipeline生产的数千种型号。 As indicated in the answer here , the models can be saved as follows: 正如在答复中指出这里，该机型可以保存如下：

import java.io._

def saveModel(name: String, model: PipelineModel) = {
  val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
  oos.writeObject(model)
  oos.close
}

schools.zip(bySchoolArrayModels).foreach{
  case (name, model) => saveModel(name, Model)
}

I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found. 我已经尝试使用s3://some/path/$name和/user/hadoop/some/path/$name因为我希望模型最终保存到amazon s3但是它们都失败并显示路径不能是找到。

How to save models to Amazon S3? 如何将模型保存到Amazon S3？

Answer 1

One way to save a model to HDFS is as following: 将模型保存到HDFS的一种方法如下：

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")

Saved model can then be loaded as: 然后可以将已保存的模型加载为：

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()

For more details see ( ref ) 有关详细信息，请参阅（参考）

Answer 2

Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. 从Apache-Spark 1.6和Scala API开始，您可以在不使用任何技巧的情况下保存模型。 Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel , indeed it has that method. 因为ML库中的所有模型都带有一个save方法，你可以在LogisticRegressionModel中检查它，实际上它有这个方法。 By the way to load the model you can use a static method. 顺便加载模型，您可以使用静态方法。

val logRegModel = LogisticRegressionModel.load("myModel.model")

Answer 3

So FileOutputStream saves to local filesystem (not through the hadoop libraries), so saving to a locally directory is the way to go about doing this. 因此FileOutputStream保存到本地文件系统（而不是通过hadoop库），因此保存到本地目录是实现此目的的方法。 That being said, the directory needs to exist, so make sure the directory exists first. 话虽如此，目录需要存在，因此请确保该目录首先存在。

That being said, depending on your model you may wish to look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html (pmml export). 话虽如此，根据您的模型，您可能希望查看https://spark.apache.org/docs/latest/mllib-pmml-model-export.html（pmml export）。

如何将模型从ML Pipeline保存到S3或HDFS？

问题描述

3 个解决方案

解决方案1
10 2015-09-19 04:12:59

解决方案2
4 2016-02-01 19:17:00

解决方案3
1 2015-08-30 06:52:47

如何将模型从ML Pipeline保存到S3或HDFS？

问题描述

3 个解决方案

解决方案1 10 2015-09-19 04:12:59

解决方案2 4 2016-02-01 19:17:00

解决方案3 1 2015-08-30 06:52:47

解决方案1
10 2015-09-19 04:12:59

解决方案2
4 2016-02-01 19:17:00

解决方案3
1 2015-08-30 06:52:47