如何在 sklearn 中对看不见的未来数据应用一种热编码

Question

I am working with Titanic data as a sample set and I have come across a use case where I want to do one hot encoding during training phase and then apply my model.我正在使用 Titanic 数据作为样本集，我遇到了一个用例，我想在训练阶段进行一次热编码，然后应用我的 model。 After this is done, I am planning to store the model so that I can load the model back and score the unseen dataset.完成此操作后，我计划存储 model 以便我可以加载 model 并为看不见的数据集score 。 The plan is have 2 .py files.该计划有 2 个.py文件。 One is train.py that will load the data, do feature engineering, apply logistic model and then save the model to disk.一个是train.py ，它将加载数据，进行特征工程，应用逻辑 model，然后save到磁盘。 Second file is score.py .第二个文件是score.py 。 In score.py , I want to first take an entire unseen dataset, load the model from disk and then score that data to generate predictions.在score.py中，我想首先获取一个完整的未见数据集，从磁盘加载 model，然后对该数据进行评分以生成预测。 The problem is that in score.py I will have to transform the raw unseen data to one-hot encoded columns before generating predictions.问题是在score.py中，我必须在生成预测之前将原始看不见的数据转换为one-hot encoded列。

Here is some code for train.py这是train.py的一些代码

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


data_set = data[['Pclass','Sex','Age','Fare','SibSp','Cabin']]
one_hot_encoded_training_predictors = pd.get_dummies(data_set)
one_hot_encoded_training_predictors.head()
X = one_hot_encoded_training_predictors
y = data['Survived']

#Train Test split---75 25 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
logreg = LogisticRegression() 
logreg.fit(X_train, y_train)


##predicting test accuracy
y_pred = logreg.predict(X_test) #predicting the values

# Save model code here

logreg.save(..)

My score.py would look like我的score.py看起来像

import statements
unseen_data = pd.read_csv(..) # this is raw unseen data

model.load(..)
model.predict(unseen_data)

Now imagine I have an unseen set which is never seen by the model.现在想象一下，我有一个看不见的集合，model 从未见过。 I can load the trained model using logreg.load(..) but the problem I am facing is, how do I first perform the one hot encoding on my raw unseen features?我可以使用logreg.load(..)加载训练有素的 model 但我面临的问题是，如何首先对我的原始看不见的特征执行一次热编码？ Can I also save the one hot encoding objects to be re-used on unseen set?我还可以save一个热编码对象以在未见过的集合上重复使用吗？ I am new to Machine Learning and I might be missing something very simple but that is the issue I need to resolve.我是机器学习的新手，我可能会遗漏一些非常简单的东西，但这是我需要解决的问题。

Answer 1

If you you use OneHotEncoder , you can handle unknown categories by setting up handle_unknown parameter to "ignore" .如果您使用OneHotEncoder ，您可以通过将handle_unknown参数设置为"ignore"来处理未知类别。 When this parameter is set to 'ignore' and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.当此参数设置为 'ignore' 并且在转换过程中遇到未知类别时，此功能生成的 one-hot 编码列将全为零。 In the inverse transform, an unknown category will be denoted as None.在逆变换中，未知类别将表示为无。

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
...

如何在 sklearn 中对看不见的未来数据应用一种热编码

问题描述

1 个解决方案

解决方案1
0 2019-09-25 16:12:30

如何在 sklearn 中对看不见的未来数据应用一种热编码

问题描述

1 个解决方案

解决方案1 0 2019-09-25 16:12:30

解决方案1
0 2019-09-25 16:12:30