简体   繁体   中英

When I hot encode a categorical variable using OneHotEncoder, do I need to remove the original column before I train a machine learning model?

I used OneHotEncoder to convert a zipcode before feeding into a Random Forest Model:

from sklearn.preprocessing import OneHotEncoder
one_hot = OneHotEncoder()
encoded = one_hot.fit_transform(df[['zipcode']])
df[one_hot.categories_[0]] = encoded.toarray()

Should I drop the orignal "zipcode" column from the independent variables? Or does sklearn account for that?

I ask mainly because "zipcode" is showing up as the second most important feature. Is that an aggregate of the importance of the all the hot encoded features?

Short answer: Yes, you need to exclude it.

sklearn has no way of knowing what features are important or which ones are not or even if there is a connection between some of them (that's why you should also try to not use correlated features too much). The OneHotEncoder merely adds features that encode your categorical variable (the zipcode). If you leave the zipcode in, it will use it as an additional feature and even more, a numerical one. So, a higher number has a meaning all of a sudden.

The deeper question now would be that: Since the numerical value of your postal codes seems to have a significant influence on your prediction, what could be the meaning of that? Do numbers increase in a geographical manner that can also be found in your dependend variable? Do higher/lower codes encode for smaller/bigger cities and thus explain some of what you see? Things like this could be interesting (or trivial) nonetheless and could be taken into account for your analysis on another level. ;D

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM