如何使用 OneHotEncoder 和 LabelEncoder 预处理看不见的数据以匹配训练集？

Question

I wrote a classifier and did a data preprocessing (it was categorical data) with scikit learn with LabelEncoder (LE) and OneHotEncoder (OHE) and it work great on train and test data.我编写了一个分类器，并使用 scikit learn with LabelEncoder (LE) 和 OneHotEncoder (OHE) 进行了数据预处理（它是分类数据），它在训练和测试数据上运行良好。 Now, i want to make predictions on new data.现在，我想对新数据进行预测。 My question: how I convert the new data with LE and OHE on the same style (in a lack of a better word) as the training data?我的问题：我如何使用 LE 和 OHE 以与训练数据相同的样式（缺少更好的词）转换新数据？ My code so far:到目前为止我的代码：

labelencoder_X = LabelEncoder()

X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features='all')# se quiser em uma coluna coloca categorical_features=[0],
#onde [0] é o index da coluna e se quiser em todas as colunas coloca 'all
X = onehotencoder.fit_transform(X).toarray()

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

what I tried without sucess:我没有成功的尝试：

new_pred = np.array(['car','male'])
labelencoder_new_pred = LabelEncoder()
new_pred = labelencoder_new_pred.fit_transform(new_pred) #also tried new_pred = labelencoder_X.fit_transform(new_pred) 
onehotencoder2 = OneHotEncoder(categorical_features='all',n_values=29)

new_pred = onehotencoder2.fit_transform(new_pred).toarray()#also tried new_pred = onehotencoder.fit_transform(new_pred).toarray()

z = cfl.predict(new_pred)

The results of this:这样做的结果：

The result is always the same, even changing the new_pred data with an equal data found in the train set结果总是相同的，即使用在训练集中找到的相等数据更改 new_pred 数据
It produced OHE different that on the train set它产生了与火车上不同的OHE

What I'm missing here?我在这里缺少什么？ Thks!谢谢！

Answer 1

You'll have to store (ie pickle) your fitted LabelEncoders and OneHotEncoder.您必须存储（即腌制）您安装的 LabelEncoders 和 OneHotEncoder。 Check it out here: model persistence在这里查看：模型持久性

When you receive new data, you'll transform them via the already-fitted LabelEncoders and OneHotEncoder and then use your trained model to make the predictions.当您收到新数据时，您将通过已安装的 LabelEncoders 和 OneHotEncoder 对它们进行转换，然后使用经过训练的模型进行预测。 This way, the produced data will be in the exact format your models expects them to be,这样，生成的数据将采用您的模型期望它们的确切格式，

Answer 2

You were trying to use the same instance to categorize two different categories.您试图使用同一个实例对两个不同的类别进行分类。 Try using something like below尝试使用类似下面的东西

labelencoder_X_1 = LabelEncoder()
labelencoder_X_2 = LabelEncoder()

X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

Now you can use现在你可以使用

new_data[:,1] = labelencoder_X_2.fit_transform(new_data[:,1])

where new_data is the sample data that you want to preprocess for prediction.其中 new_data 是您要预处理以进行预测的样本数据。

Similarly you can use the same method for Encoding同样，您可以使用相同的方法进行编码

如何使用 OneHotEncoder 和 LabelEncoder 预处理看不见的数据以匹配训练集？

问题描述

2 个解决方案

解决方案1
2 2017-08-07 11:12:27

解决方案2
0 2017-10-31 05:09:57

如何使用 OneHotEncoder 和 LabelEncoder 预处理看不见的数据以匹配训练集？

问题描述

2 个解决方案

解决方案1 2 2017-08-07 11:12:27

解决方案2 0 2017-10-31 05:09:57

解决方案1
2 2017-08-07 11:12:27

解决方案2
0 2017-10-31 05:09:57