简体   繁体   English

sklearns OneHotEncoder 究竟是如何工作的?

[英]How exactly does sklearns OneHotEncoder work?

I am trying to use sklearns OneHotEncoder on a subset of the titanic dataset (pandas dataframe).我正在尝试在 Titanic 数据集(熊猫数据框)的子集上使用 sklearns OneHotEncoder。

The documentation reads该文档读取

"By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually." “默认情况下,编码器根据每个特征中的唯一值派生类别。或者,您也可以手动指定类别。”

and also states that I don't need to specify the categories since it is done automatically并且还声明我不需要指定类别,因为它是自动完成的

'auto': Determine categories automatically from the training data. 'auto':根据训练数据自动确定类别。 (default) (默认)

So using this I write:所以我用这个写:

print(x_train.head())
enc = OneHotEncoder(handle_unknown="ignore")
print("_____________")
print(x_train.shape)
x_train = enc.fit_transform(x_train)
print(x_train.shape)
print(x_train.toarray())

and get output并得到 output

     Pclass   Sex        Age  SibSp  Parch     Fare Cabin Embarked
845       3  male  42.000000      0      0   7.5500  None        S
162       3  male  26.000000      0      0   7.7750  None        S
630       1  male  80.000000      0      0  30.0000   A23        S
176       3  male  29.699118      3      1  25.4667  None        S
115       3  male  21.000000      0      0   7.9250  None        S
_____________
(712, 8)
(712, 460)
[[0. 0. 1. ... 0. 0. 1.]
 [0. 0. 1. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 1.]
 ...
 [0. 0. 1. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 0. 0. 1.]]

So I can see that more features has been added (as it should) but what categories are actually being encoded?所以我可以看到已经添加了更多功能(应该如此),但实际编码的是哪些类别? All of them?他们全部? If so, the age has a finite number of "categories" but is clearly not a categorical variable.如果是这样,则年龄具有有限数量的“类别”,但显然不是分类变量。 Is this not a problem?这不是问题吗? If it is using the pandas dataframe column type to determine if to onehotencode or not, then what happens to the "Pclass" which has type int but is clearly categorical?如果它使用 pandas dataframe 列类型来确定是否为 onehotencode,那么具有 int 类型但显然是分类的“Pclass”会发生什么?

One hot encoding should be used for categorical features only.一种热编码应仅用于分类特征。 If you have 3 categorical features which can respectively take 1,2 and 3 different values you will have 1+2+3 new features.如果您有 3 个分类特征,它们可以分别取 1,2 和 3 个不同的值,那么您将拥有 1+2+3 个新特征。 For example if your features is fruits and can take value apple, pineapple and pear after one hot encoding you have 3 new features ( apple, pineapple and pear) that can each take 1 or 0 as value.例如,如果您的特征是水果,并且可以在一次热编码后取值为 apple、pineapple 和 pear,那么您有 3 个新特征(apple、pineapple 和 pear),每个都可以取 1 或 0 作为值。
Age is not a categorical features you should not use one hot encoding for it.年龄不是一个分类特征,您不应该对其使用一种热编码。

You have to define your categorial variables , and then apply to these features, the OneHotEncoder.您必须定义类别变量,然后将OneHotEncoder.

If you don't define categorical features, OneHotEncoder will encode every features (categorical or not).如果您没有定义分类特征,OneHotEncoder 将编码每个特征(分类或非分类)。

So, I strongly recommend you to previously define the categorial features and apply them the OneHotEncoder.因此,我强烈建议您预先定义分类特征并将它们应用到 OneHotEncoder。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM