简体   繁体   中英

How exactly does sklearns OneHotEncoder work?

I am trying to use sklearns OneHotEncoder on a subset of the titanic dataset (pandas dataframe).

The documentation reads

"By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually."

and also states that I don't need to specify the categories since it is done automatically

'auto': Determine categories automatically from the training data. (default)

So using this I write:

print(x_train.head())
enc = OneHotEncoder(handle_unknown="ignore")
print("_____________")
print(x_train.shape)
x_train = enc.fit_transform(x_train)
print(x_train.shape)
print(x_train.toarray())

and get output

     Pclass   Sex        Age  SibSp  Parch     Fare Cabin Embarked
845       3  male  42.000000      0      0   7.5500  None        S
162       3  male  26.000000      0      0   7.7750  None        S
630       1  male  80.000000      0      0  30.0000   A23        S
176       3  male  29.699118      3      1  25.4667  None        S
115       3  male  21.000000      0      0   7.9250  None        S
_____________
(712, 8)
(712, 460)
[[0. 0. 1. ... 0. 0. 1.]
 [0. 0. 1. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 1.]
 ...
 [0. 0. 1. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 0. 0. 1.]]

So I can see that more features has been added (as it should) but what categories are actually being encoded? All of them? If so, the age has a finite number of "categories" but is clearly not a categorical variable. Is this not a problem? If it is using the pandas dataframe column type to determine if to onehotencode or not, then what happens to the "Pclass" which has type int but is clearly categorical?

One hot encoding should be used for categorical features only. If you have 3 categorical features which can respectively take 1,2 and 3 different values you will have 1+2+3 new features. For example if your features is fruits and can take value apple, pineapple and pear after one hot encoding you have 3 new features ( apple, pineapple and pear) that can each take 1 or 0 as value.
Age is not a categorical features you should not use one hot encoding for it.

You have to define your categorial variables , and then apply to these features, the OneHotEncoder.

If you don't define categorical features, OneHotEncoder will encode every features (categorical or not).

So, I strongly recommend you to previously define the categorial features and apply them the OneHotEncoder.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM