I am trying to use sklearns OneHotEncoder on a subset of the titanic dataset (pandas dataframe).
The documentation reads
"By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually."
and also states that I don't need to specify the categories since it is done automatically
'auto': Determine categories automatically from the training data. (default)
So using this I write:
print(x_train.head())
enc = OneHotEncoder(handle_unknown="ignore")
print("_____________")
print(x_train.shape)
x_train = enc.fit_transform(x_train)
print(x_train.shape)
print(x_train.toarray())
and get output
Pclass Sex Age SibSp Parch Fare Cabin Embarked
845 3 male 42.000000 0 0 7.5500 None S
162 3 male 26.000000 0 0 7.7750 None S
630 1 male 80.000000 0 0 30.0000 A23 S
176 3 male 29.699118 3 1 25.4667 None S
115 3 male 21.000000 0 0 7.9250 None S
_____________
(712, 8)
(712, 460)
[[0. 0. 1. ... 0. 0. 1.]
[0. 0. 1. ... 0. 0. 1.]
[1. 0. 0. ... 0. 0. 1.]
...
[0. 0. 1. ... 0. 0. 1.]
[1. 0. 0. ... 0. 0. 1.]
[0. 0. 1. ... 0. 0. 1.]]
So I can see that more features has been added (as it should) but what categories are actually being encoded? All of them? If so, the age has a finite number of "categories" but is clearly not a categorical variable. Is this not a problem? If it is using the pandas dataframe column type to determine if to onehotencode or not, then what happens to the "Pclass" which has type int but is clearly categorical?
One hot encoding should be used for categorical features only. If you have 3 categorical features which can respectively take 1,2 and 3 different values you will have 1+2+3 new features. For example if your features is fruits and can take value apple, pineapple and pear after one hot encoding you have 3 new features ( apple, pineapple and pear) that can each take 1 or 0 as value.
Age is not a categorical features you should not use one hot encoding for it.
You have to define your categorial variables , and then apply to these features, the OneHotEncoder.
If you don't define categorical features, OneHotEncoder will encode every features (categorical or not).
So, I strongly recommend you to previously define the categorial features and apply them the OneHotEncoder.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.