[英]How to apply one hot encoding on unseen future data in sklearn
I am working with Titanic data as a sample set and I have come across a use case where I want to do one hot encoding during training phase and then apply my model.我正在使用 Titanic 数据作为样本集,我遇到了一个用例,我想在训练阶段进行一次热编码,然后应用我的 model。 After this is done, I am planning to store the model so that I can load the model back and
score
the unseen dataset.完成此操作后,我计划存储 model 以便我可以加载 model 并为看不见的数据集
score
。 The plan is have 2 .py
files.该计划有 2 个
.py
文件。 One is train.py
that will load the data, do feature engineering, apply logistic model and then save
the model to disk.一个是
train.py
,它将加载数据,进行特征工程,应用逻辑 model,然后save
到磁盘。 Second file is score.py
.第二个文件是
score.py
。 In score.py
, I want to first take an entire unseen dataset, load the model from disk and then score that data to generate predictions.在
score.py
中,我想首先获取一个完整的未见数据集,从磁盘加载 model,然后对该数据进行评分以生成预测。 The problem is that in score.py
I will have to transform the raw unseen data to one-hot encoded
columns before generating predictions.问题是在
score.py
中,我必须在生成预测之前将原始看不见的数据转换为one-hot encoded
列。
Here is some code for train.py
这是
train.py
的一些代码
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
data_set = data[['Pclass','Sex','Age','Fare','SibSp','Cabin']]
one_hot_encoded_training_predictors = pd.get_dummies(data_set)
one_hot_encoded_training_predictors.head()
X = one_hot_encoded_training_predictors
y = data['Survived']
#Train Test split---75 25 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
##predicting test accuracy
y_pred = logreg.predict(X_test) #predicting the values
# Save model code here
logreg.save(..)
My score.py
would look like我的
score.py
看起来像
import statements
unseen_data = pd.read_csv(..) # this is raw unseen data
model.load(..)
model.predict(unseen_data)
Now imagine I have an unseen set which is never seen by the model.现在想象一下,我有一个看不见的集合,model 从未见过。 I can load the trained model using
logreg.load(..)
but the problem I am facing is, how do I first perform the one hot encoding on my raw unseen features?我可以使用
logreg.load(..)
加载训练有素的 model 但我面临的问题是,如何首先对我的原始看不见的特征执行一次热编码? Can I also save
the one hot encoding objects to be re-used on unseen set?我还可以
save
一个热编码对象以在未见过的集合上重复使用吗? I am new to Machine Learning and I might be missing something very simple but that is the issue I need to resolve.我是机器学习的新手,我可能会遗漏一些非常简单的东西,但这是我需要解决的问题。
If you you use OneHotEncoder , you can handle unknown categories by setting up handle_unknown parameter to "ignore" .如果您使用OneHotEncoder ,您可以通过将handle_unknown参数设置为"ignore"来处理未知类别。 When this parameter is set to 'ignore' and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.
当此参数设置为 'ignore' 并且在转换过程中遇到未知类别时,此功能生成的 one-hot 编码列将全为零。 In the inverse transform, an unknown category will be denoted as None.
在逆变换中,未知类别将表示为无。
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.