简体   繁体   English

使用保存的模型预测新数据时出现火花数据帧格式错误

[英]spark dataframe format error when using saved model to predict on new data

I am able to train a model and saved the model (Train.scala). 我能够训练模型并保存模型(Train.scala)。 Now i want to use that trained model to predict on new data (Predict.scala). 现在,我想使用训练有素的模型对新数据进行预测(Predict.scala)。

I create a new VectorAssembler in my Predict.scala to featurize the new data. 我创建了一个新的VectorAssembler在我Predict.scala到特征化的新数据。 Should I use the same VectorAssembler in the Train.scala for the Predict.scala file? 我应该使用相同的VectorAssemblerTrain.scalaPredict.scala文件? Because I am seeing issues with feature data type after transformation. 因为我看到转换后的要素数据类型有问题。

For example: when i read in the trained model and try to predict on the new data that is featurized, i got this error: 例如:当我读取训练有素的模型并尝试预测已完成的新数据时,出现此错误:

type mismatch;
[error]  found   : org.apache.spark.sql.DataFrame
[error]     (which expands to)  org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
[error]  required: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] => org.apache.spark.sql.Dataset[?]
[error]     val predictions = model.transform(featureData)

Training code: Train.scala 培训代码:Train.scala

    // assembler
    val assembler = new VectorAssembler()
    .setInputCols(feature_list)
    .setOutputCol("features")

    //read in train data
    val trainingData = spark
      .read
      .parquet(train_data_path)

    // generate training features
    val trainingFeatures = assembler.transform(trainingData)

    //define model
    val lightGBMClassifier = new LightGBMClassifier()
        .setLabelCol("label")
        .setFeaturesCol("features")
        .setIsUnbalance(true)
        .setMaxDepth(25)
        .setNumLeaves(31)
        .setNumIterations(100) 

    // fit model
    val lgbm = lightGBMClassifier.fit(trainingFeatures)

    //save model
    lgbm
      .write
      .overwrite()
      .save(my_model_s3_path)

Predict code: Predict.scala 预测代码:Predict.scala

val assembler = new VectorAssembler()
    .setInputCols(feature_list)
    .setOutputCol("features")

// load model
val model = spark.read.parquet(my_model_s3_path)

// load new data
val inputData = spark.read.parquet(new_data_path)

//Assembler to transform new data
val featureData = assembler.transform(inputData)

//predict on new data
val predictions = model.transform(featureData) ### <- got error here

Should i be using a different method to read in my trained model or transform my data? 我是否应该使用其他方法读取训练有素的模型或转换数据?

"Should I use the same VectorAssembler in the Train.scala for the Predict.scala file?" “我应该在Train.scala中为Predict.scala文件使用相同的VectorAssembler吗?” Yes, however, I would strong recommend to use Pipelines . 是的,但是,我强烈建议您使用Pipelines

// Train.scala
val pipeline = new Pipeline().setStages(Array(assembler, lightGBMClassifier))
val pipelineModel = pipeline.fit(trainingData)
pipelineModel.write.overwrite().save("/path/to/pipelineModel")

// Predict.scala
val pipelineModel = PipelineModel.load("/path/to/pipelineModel")
val predictions = pipelineModel.transform(inputData)

See if the issue goes away but simply using Pipelines, serializing/deserializing the model correctly, and structuring your code better. 看看问题是否消失了,只需使用管道,正确地对模型进行序列化/反序列化以及更好地组织代码即可。 Also, make sure that trainingData and inputData both contain the same columns listed in feature_list . 另外,请确保trainingDatainputData都包含在feature_list中列出的相同列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM