[英]spark dataframe format error when using saved model to predict on new data
I am able to train a model and saved the model (Train.scala). 我能够训练模型并保存模型(Train.scala)。 Now i want to use that trained model to predict on new data (Predict.scala).
现在,我想使用训练有素的模型对新数据进行预测(Predict.scala)。
I create a new VectorAssembler
in my Predict.scala
to featurize the new data. 我创建了一个新的
VectorAssembler
在我Predict.scala
到特征化的新数据。 Should I use the same VectorAssembler
in the Train.scala
for the Predict.scala
file? 我应该使用相同的
VectorAssembler
在Train.scala
为Predict.scala
文件? Because I am seeing issues with feature data type after transformation. 因为我看到转换后的要素数据类型有问题。
For example: when i read in the trained model and try to predict on the new data that is featurized, i got this error: 例如:当我读取训练有素的模型并尝试预测已完成的新数据时,出现此错误:
type mismatch;
[error] found : org.apache.spark.sql.DataFrame
[error] (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
[error] required: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] => org.apache.spark.sql.Dataset[?]
[error] val predictions = model.transform(featureData)
Training code: Train.scala 培训代码:Train.scala
// assembler
val assembler = new VectorAssembler()
.setInputCols(feature_list)
.setOutputCol("features")
//read in train data
val trainingData = spark
.read
.parquet(train_data_path)
// generate training features
val trainingFeatures = assembler.transform(trainingData)
//define model
val lightGBMClassifier = new LightGBMClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setIsUnbalance(true)
.setMaxDepth(25)
.setNumLeaves(31)
.setNumIterations(100)
// fit model
val lgbm = lightGBMClassifier.fit(trainingFeatures)
//save model
lgbm
.write
.overwrite()
.save(my_model_s3_path)
Predict code: Predict.scala 预测代码:Predict.scala
val assembler = new VectorAssembler()
.setInputCols(feature_list)
.setOutputCol("features")
// load model
val model = spark.read.parquet(my_model_s3_path)
// load new data
val inputData = spark.read.parquet(new_data_path)
//Assembler to transform new data
val featureData = assembler.transform(inputData)
//predict on new data
val predictions = model.transform(featureData) ### <- got error here
Should i be using a different method to read in my trained model or transform my data? 我是否应该使用其他方法读取训练有素的模型或转换数据?
"Should I use the same VectorAssembler in the Train.scala for the Predict.scala file?" “我应该在Train.scala中为Predict.scala文件使用相同的VectorAssembler吗?” Yes, however, I would strong recommend to use Pipelines .
是的,但是,我强烈建议您使用Pipelines 。
// Train.scala
val pipeline = new Pipeline().setStages(Array(assembler, lightGBMClassifier))
val pipelineModel = pipeline.fit(trainingData)
pipelineModel.write.overwrite().save("/path/to/pipelineModel")
// Predict.scala
val pipelineModel = PipelineModel.load("/path/to/pipelineModel")
val predictions = pipelineModel.transform(inputData)
See if the issue goes away but simply using Pipelines, serializing/deserializing the model correctly, and structuring your code better. 看看问题是否消失了,只需使用管道,正确地对模型进行序列化/反序列化以及更好地组织代码即可。 Also, make sure that trainingData and inputData both contain the same columns listed in feature_list .
另外,请确保trainingData和inputData都包含在feature_list中列出的相同列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.