简体   繁体   English

如何使用 scikitlearn 保存一个热编码模型并预测新的未编码数据?

[英]How to save one hot encoded model and predict new unencoded data using scikitlearn?

My dataset contains 3 categorical features and I used one hot encoding to change it to binary format and all went fine.我的数据集包含 3 个分类特征,我使用一种热编码将其更改为二进制格式,一切顺利。 But when I want to save that trained model and predict new raw data, the inputted is not encoded as I expected and result in error.但是当我想保存训练好的模型并预测新的原始数据时,输入的没有按照我的预期进行编码并导致错误。

combined_df_raw2= pd.concat([train_x_raw,unknown_test_df])
combined_df2 = pd.get_dummies(combined_df_raw2, columns=nominal_cols, 
drop_first=True)

encoded_unknown_df = combined_df2[len(train_x_raw):]

classifier = DecisionTreeClassifier(random_state=17)
classifier.fit(train_x_raw, train_Y)

pred_y = classifier.predict(encoded_unknown_df)

#here I use joblib to save my model and load it again
joblib.dump(classifier, 'savedmodel')
imported_model = joblib.load('savedmodel')

#here I input unencoded raw data for predict and got error that cannot             
convert 'tcp' to float, means that it is not encoded 

imported_model.predict([0,'tcp','vmnet','REJ',0,0,0,23])   

ValueError: could not convert string to float: 'tcp' ValueError: 无法将字符串转换为浮点数:'tcp'

The model is trained after encoding the categorical variable, hence, the input has to be given after applying 'onehot encoding' to respective variables.在对分类变量进行编码后训练模型,因此,必须在对各个变量应用“onehot 编码”后给出输入。 Example: one of the column is titeled as "Country" and you have three different values across the dataset viz.示例:其中一列的标题为“国家/地区”,并且您在数据集中具有三个不同的值,即。 ['India', Israel', 'France'], now you have applied OneHotEncoding(Probably after LabelEncoder) on the country column, then you train your model, save it do whatever other stuff you want! ['India', Israel', 'France'],现在你已经在 country 列上应用了 OneHotEncoding(可能在 LabelEncoder 之后),然后你训练你的模型,保存它做你想做的任何其他事情!

Now the question is, you get input error when you directly give input without changing it to the format on which the model was trained.现在的问题是,当您直接提供输入而不将其更改为训练模型时使用的格式时,您会得到输入错误。 Hence, we will always want to preprocess the input before we give it to model.因此,我们总是希望在将输入提供给模型之前对其进行预处理。 The most common way in my knowledge is to use Pipeline.据我所知,最常见的方法是使用 Pipeline。

steps = [('scaler', StandardScaler()), ('ohe', OneHotEncoder()),('SVM', 
        DecisionTreeClassifier())]
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps) # You need to save this pipeline via joblib
pipe.fit(X_train, y_train)

Incase, you don't want to use Pipeline, you can anyways use OneHotEncode on specific column/s and then use predict!以防万一,您不想使用流水线,无论如何您都可以在特定列上使用 OneHotEncode,然后使用预测!

Use fit() followed by transform() , that way you can pickle your one hot encoder after you have fit it.使用fit()后跟transform() ,这样您就可以在安装好一个热编码器后对其进行腌制。

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

Then lets pickle your encoder, you could use other ways of persisting the encoder.然后让我们腌制您的编码器,您可以使用其他方式来持久化编码器。 Check out, https://scikit-learn.org/stable/modules/model_persistence.html查看, https://scikit-learn.org/stable/modules/model_persistence.html

import pickle
with open('encoder.pickle', 'wb') as f:
    pickle.dump(enc, f)

Now when you have new data to predict, you must first go through the entire pre-processing pipeline you did for your training data.现在,当您有新数据要预测时,您必须首先检查您为训练数据所做的整个预处理管道。 In this case the encoder.在这种情况下编码器。 Let's load it back,让我们重新加载它,

with open('encoder.pickle', 'rb') as f:
    enc = pickle.loads(f)

Once you have it loaded, you just need to transform the new data.加载后,您只需要转换新数据。

enc.transform(new_data)

To know more about pickle, https://docs.python.org/3/library/pickle.html要了解有关泡菜的更多信息,请访问 https://docs.python.org/3/library/pickle.html

@chintan then eg for the upcoming raw data, if you convert the categorial variable having only one instance then it will make only one extra column, while before for the categorical column you had, vod be having like 500 columns. @chintan 然后例如对于即将到来的原始数据,如果您转换只有一个实例的分类变量,那么它只会产生一列额外的列,而在您拥有的分类列之前,vod 大约有 500 列。 so it wont match again.所以它不会再次匹配。 take an example of currencies, one instance is coming have INR only, even if you do the encoding, it will convert it into a column, but before you have columns for all the curruncies in the world以货币为例,一个实例只有INR,即使您进行编码,它也会将其转换为一列,但在您拥有世界上所有货币的列之前

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM