[英]How preprocessing unseen data with OneHotEncoder and LabelEncoder so that matchs the train set?
I wrote a classifier and did a data preprocessing (it was categorical data) with scikit learn with LabelEncoder (LE) and OneHotEncoder (OHE) and it work great on train and test data.我编写了一个分类器,并使用 scikit learn with LabelEncoder (LE) 和 OneHotEncoder (OHE) 进行了数据预处理(它是分类数据),它在训练和测试数据上运行良好。 Now, i want to make predictions on new data.现在,我想对新数据进行预测。 My question: how I convert the new data with LE and OHE on the same style (in a lack of a better word) as the training data?我的问题:我如何使用 LE 和 OHE 以与训练数据相同的样式(缺少更好的词)转换新数据? My code so far:到目前为止我的代码:
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features='all')# se quiser em uma coluna coloca categorical_features=[0],
#onde [0] é o index da coluna e se quiser em todas as colunas coloca 'all
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
what I tried without sucess:我没有成功的尝试:
new_pred = np.array(['car','male'])
labelencoder_new_pred = LabelEncoder()
new_pred = labelencoder_new_pred.fit_transform(new_pred) #also tried new_pred = labelencoder_X.fit_transform(new_pred)
onehotencoder2 = OneHotEncoder(categorical_features='all',n_values=29)
new_pred = onehotencoder2.fit_transform(new_pred).toarray()#also tried new_pred = onehotencoder.fit_transform(new_pred).toarray()
z = cfl.predict(new_pred)
The results of this:这样做的结果:
What I'm missing here?我在这里缺少什么? Thks!谢谢!
You'll have to store (ie pickle) your fitted LabelEncoders and OneHotEncoder.您必须存储(即腌制)您安装的 LabelEncoders 和 OneHotEncoder。 Check it out here: model persistence在这里查看: 模型持久性
When you receive new data, you'll transform them via the already-fitted LabelEncoders and OneHotEncoder and then use your trained model to make the predictions.当您收到新数据时,您将通过已安装的 LabelEncoders 和 OneHotEncoder 对它们进行转换,然后使用经过训练的模型进行预测。 This way, the produced data will be in the exact format your models expects them to be,这样,生成的数据将采用您的模型期望它们的确切格式,
You were trying to use the same instance to categorize two different categories.您试图使用同一个实例对两个不同的类别进行分类。 Try using something like below尝试使用类似下面的东西
labelencoder_X_1 = LabelEncoder()
labelencoder_X_2 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])
Now you can use现在你可以使用
new_data[:,1] = labelencoder_X_2.fit_transform(new_data[:,1])
where new_data is the sample data that you want to preprocess for prediction.其中 new_data 是您要预处理以进行预测的样本数据。
Similarly you can use the same method for Encoding同样,您可以使用相同的方法进行编码
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.