部署具有一个热编码特征的机器学习模型

Question

I have trained an xgboost classifier with categorical features that I have previously one hot encoded.我已经训练了一个 xgboost 分类器，该分类器具有我以前进行过热编码的分类特征。 For example, I have a categorical feature 'Year' which takes values between 2014 and 2018. When OHEd I get 5 binary features: Year_2014, Year_2015, Year_2016, Year_2017, Year_2018.例如，我有一个分类特征“Year”，它取 2014 年和 2018 年之间的值。当 OHEd 时，我得到 5 个二元特征：Year_2014、Year_2015、Year_2016、Year_2017、Year_2018。 What happens if I make a prediction on a sample that has Year=2019 since the feature Year_2019 does not exist?如果由于特征 Year_2019 不存在而对 Year=2019 的样本进行预测，会发生什么情况？

More generally, what is a robust way to transform data in order to make predictions on a new samples?更一般地说，为了对新样本进行预测，转换数据的稳健方法是什么？

Answer 1

Binary features are evaluated like this:二元特征的评估如下：

if(year != ${year value}){
  // Enter "left" branch
} else {
  // Enter "right" branch
}

An unseen category level gets sent to the "left" branch.一个看不见的类别级别被发送到“左”分支。

Answer 2

#While traning say year has below values
df = pd.DataFrame([2014,2015,2016,2017,2018], columns = ['year']) 
data=pd.get_dummies(df,columns=['year']) 
data.head()
# while predicting lets say input for year is 2018
known_categories = ['2014','2015','2016','2017','2018']    
year_type = pd.Series(['2018']) 
year_type = pd.Categorical(year_type, categories = known_categories)
pd.get_dummies(year_type)
# column name does not matter only the values matters which will be input to the model

部署具有一个热编码特征的机器学习模型

问题描述

2 个解决方案

解决方案1
0 2019-03-07 20:20:57

解决方案2
0 2021-02-18 06:17:36

部署具有一个热编码特征的机器学习模型

问题描述

2 个解决方案

解决方案1 0 2019-03-07 20:20:57

解决方案2 0 2021-02-18 06:17:36

解决方案1
0 2019-03-07 20:20:57

解决方案2
0 2021-02-18 06:17:36