简体   繁体   中英

Can sklearn random forest classifier handle categorical variables?

I found this thread from 2014 and the answer states that no, sklearn random forest classifier cannot handle categorical variables (or at least not directly). Has the answer changed in 2020?

I want to feed gender as a feature for my model. However, gender can take on three values: M , F of np.nan . If I encode this column into three dichotomous columns, how can the random forest classifier know that these three columns represent a single feature?

Imagine max_features = 7. When training a given tree, it will randomly randomly pick seven features. Suppose gender was chosen. If gender is split into three columns ( gender_M , gender_F , gender_NA ), will the random forest classifier always pick all three columns and count it as one feature, or is there a chance that it will only pick one or two?

If max_features is set to a value lower than the actual amount of columns (which is the advisable approach, see the recommended values for max_features in the docs ), then yes, there is a chance that for a given estimator in the random forest only a subset of the dummy columns is considered.

But that is not necessarily too bad. In decision trees, a feature is selected as node at a given level aiming at optimizing some metric, independently from the other features, that is, only considering the actual feature and the target. So in a sense the model will not treat these dummy columns as belonging to the same feature .

In general though, the best approach for binary features is to come up with an appropriate method to fill missing values, and convert it into a single column encoded to 0 s and 1 s.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM