简体   繁体   English

如何使用 OneHotEncoder 和 LabelEncoder 预处理看不见的数据以匹配训练集?

[英]How preprocessing unseen data with OneHotEncoder and LabelEncoder so that matchs the train set?

I wrote a classifier and did a data preprocessing (it was categorical data) with scikit learn with LabelEncoder (LE) and OneHotEncoder (OHE) and it work great on train and test data.我编写了一个分类器,并使用 scikit learn with LabelEncoder (LE) 和 OneHotEncoder (OHE) 进行了数据预处理(它是分类数据),它在训练和测试数据上运行良好。 Now, i want to make predictions on new data.现在,我想对新数据进行预测。 My question: how I convert the new data with LE and OHE on the same style (in a lack of a better word) as the training data?我的问题:我如何使用 LE 和 OHE 以与训练数据相同的样式(缺少更好的词)转换新数据? My code so far:到目前为止我的代码:

labelencoder_X = LabelEncoder()

X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features='all')# se quiser em uma coluna coloca categorical_features=[0],
#onde [0] é o index da coluna e se quiser em todas as colunas coloca 'all
X = onehotencoder.fit_transform(X).toarray()

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

what I tried without sucess:我没有成功的尝试:

new_pred = np.array(['car','male'])
labelencoder_new_pred = LabelEncoder()
new_pred = labelencoder_new_pred.fit_transform(new_pred) #also tried new_pred = labelencoder_X.fit_transform(new_pred) 
onehotencoder2 = OneHotEncoder(categorical_features='all',n_values=29)

new_pred = onehotencoder2.fit_transform(new_pred).toarray()#also tried new_pred = onehotencoder.fit_transform(new_pred).toarray()

z = cfl.predict(new_pred)

The results of this:这样做的结果:

  1. The result is always the same, even changing the new_pred data with an equal data found in the train set结果总是相同的,即使用在训练集中找到的相等数据更改 new_pred 数据
  2. It produced OHE different that on the train set它产生了与火车上不同的OHE

What I'm missing here?我在这里缺少什么? Thks!谢谢!

You'll have to store (ie pickle) your fitted LabelEncoders and OneHotEncoder.您必须存储(即腌制)您安装的 LabelEncoders 和 OneHotEncoder。 Check it out here: model persistence在这里查看: 模型持久性

When you receive new data, you'll transform them via the already-fitted LabelEncoders and OneHotEncoder and then use your trained model to make the predictions.当您收到新数据时,您将通过已安装的 LabelEncoders 和 OneHotEncoder 对它们进行转换,然后使用经过训练的模型进行预测。 This way, the produced data will be in the exact format your models expects them to be,这样,生成的数据将采用您的模型期望它们的确切格式,

You were trying to use the same instance to categorize two different categories.您试图使用同一个实例对两个不同的类别进行分类。 Try using something like below尝试使用类似下面的东西

labelencoder_X_1 = LabelEncoder()
labelencoder_X_2 = LabelEncoder()

X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

Now you can use现在你可以使用

new_data[:,1] = labelencoder_X_2.fit_transform(new_data[:,1])

where new_data is the sample data that you want to preprocess for prediction.其中 new_data 是您要预处理以进行预测的样本数据。

Similarly you can use the same method for Encoding同样,您可以使用相同的方法进行编码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 数据集上的sklearn.preprocessing.LabelEncoder TypeError - sklearn.preprocessing.LabelEncoder TypeError on data set 数据预处理错误OneHotEncoder错误 - Data PreProcessing error OneHotEncoder error 如何在 OneHotEncoder 和 LabelEncoder 中做 inverse_transform? - How to do inverse_transform in OneHotEncoder and LabelEncoder? 我如何处理 NLP 问题中的预处理和看不见的数据? - How do I deal with preprocessing and with unseen data in a NLP problem? 在Anaconda中更新软件包后,“从sklearn.preprocessing导入LabelEncoder,OneHotEncoder”失败 - “from sklearn.preprocessing import LabelEncoder, OneHotEncoder” fails after update of packages in Anaconda labelencoder和OneHotEncoder的值错误 - Value error with labelencoder and OneHotEncoder 使用来自sklearn的LabelEncoder和OneHotEncoder编码数据时出现意外问题 - Unexpected issue when encoding data using LabelEncoder and OneHotEncoder from sklearn scikit-learn:如何使用管道组合 LabelEncoder 和 OneHotEncoder? - scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline? Python - 如何在经过 train_test_split 拆分后反转使用 LabelEncoder 编码的数据的编码? - Python - How to reverse the encoding of data encoded with LabelEncoder after it has been split by train_test_split? 如何使用python处理测试数据集中看不见的分类值? - How to handle unseen categorical values in test data set using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM