I found this thread from 2014 and the answer states that no, sklearn random forest classifier cannot handle categorical variables (or at least not directly). Has the answer changed in 2020?
I want to feed gender
as a feature for my model. However, gender
can take on three values: M
, F
of np.nan
. If I encode this column into three dichotomous columns, how can the random forest classifier know that these three columns represent a single feature?
Imagine max_features
= 7. When training a given tree, it will randomly randomly pick seven features. Suppose gender
was chosen. If gender
is split into three columns ( gender_M
, gender_F
, gender_NA
), will the random forest classifier always pick all three columns and count it as one feature, or is there a chance that it will only pick one or two?
If max_features
is set to a value lower than the actual amount of columns (which is the advisable approach, see the recommended values for max_features
in the docs ), then yes, there is a chance that for a given estimator in the random forest only a subset of the dummy columns is considered.
But that is not necessarily too bad. In decision trees, a feature is selected as node at a given level aiming at optimizing some metric, independently from the other features, that is, only considering the actual feature and the target. So in a sense the model will not treat these dummy columns as belonging to the same feature .
In general though, the best approach for binary features is to come up with an appropriate method to fill missing values, and convert it into a single column encoded to 0
s and 1
s.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.