使用保存的模型预测新数据时出现火花数据帧格式错误

Question

I am able to train a model and saved the model (Train.scala). 我能够训练模型并保存模型（Train.scala）。 Now i want to use that trained model to predict on new data (Predict.scala). 现在，我想使用训练有素的模型对新数据进行预测（Predict.scala）。

I create a new VectorAssembler in my Predict.scala to featurize the new data. 我创建了一个新的VectorAssembler在我Predict.scala到特征化的新数据。 Should I use the same VectorAssembler in the Train.scala for the Predict.scala file? 我应该使用相同的VectorAssembler在Train.scala为Predict.scala文件？ Because I am seeing issues with feature data type after transformation. 因为我看到转换后的要素数据类型有问题。

For example: when i read in the trained model and try to predict on the new data that is featurized, i got this error: 例如：当我读取训练有素的模型并尝试预测已完成的新数据时，出现此错误：

type mismatch;
[error]  found   : org.apache.spark.sql.DataFrame
[error]     (which expands to)  org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
[error]  required: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] => org.apache.spark.sql.Dataset[?]
[error]     val predictions = model.transform(featureData)

Training code: Train.scala 培训代码：Train.scala

    // assembler
    val assembler = new VectorAssembler()
    .setInputCols(feature_list)
    .setOutputCol("features")

    //read in train data
    val trainingData = spark
      .read
      .parquet(train_data_path)

    // generate training features
    val trainingFeatures = assembler.transform(trainingData)

    //define model
    val lightGBMClassifier = new LightGBMClassifier()
        .setLabelCol("label")
        .setFeaturesCol("features")
        .setIsUnbalance(true)
        .setMaxDepth(25)
        .setNumLeaves(31)
        .setNumIterations(100) 

    // fit model
    val lgbm = lightGBMClassifier.fit(trainingFeatures)

    //save model
    lgbm
      .write
      .overwrite()
      .save(my_model_s3_path)

Predict code: Predict.scala 预测代码：Predict.scala

val assembler = new VectorAssembler()
    .setInputCols(feature_list)
    .setOutputCol("features")

// load model
val model = spark.read.parquet(my_model_s3_path)

// load new data
val inputData = spark.read.parquet(new_data_path)

//Assembler to transform new data
val featureData = assembler.transform(inputData)

//predict on new data
val predictions = model.transform(featureData) ### <- got error here

Should i be using a different method to read in my trained model or transform my data? 我是否应该使用其他方法读取训练有素的模型或转换数据？

Answer 1

"Should I use the same VectorAssembler in the Train.scala for the Predict.scala file?" “我应该在Train.scala中为Predict.scala文件使用相同的VectorAssembler吗？” Yes, however, I would strong recommend to use Pipelines . 是的，但是，我强烈建议您使用Pipelines 。

// Train.scala
val pipeline = new Pipeline().setStages(Array(assembler, lightGBMClassifier))
val pipelineModel = pipeline.fit(trainingData)
pipelineModel.write.overwrite().save("/path/to/pipelineModel")

// Predict.scala
val pipelineModel = PipelineModel.load("/path/to/pipelineModel")
val predictions = pipelineModel.transform(inputData)

See if the issue goes away but simply using Pipelines, serializing/deserializing the model correctly, and structuring your code better. 看看问题是否消失了，只需使用管道，正确地对模型进行序列化/反序列化以及更好地组织代码即可。 Also, make sure that trainingData and inputData both contain the same columns listed in feature_list . 另外，请确保trainingData和inputData都包含在feature_list中列出的相同列。

使用保存的模型预测新数据时出现火花数据帧格式错误

问题描述

1 个解决方案

解决方案1
0 2019-08-30 23:02:08

使用保存的模型预测新数据时出现火花数据帧格式错误

问题描述

1 个解决方案

解决方案1 0 2019-08-30 23:02:08

解决方案1
0 2019-08-30 23:02:08