简体繁体中英

Can sklearn random forest classifier handle categorical variables?

原文 2020-04-30 16:46:12 0 1 python/ machine-learning/ scikit-learn/ random-forest

I found this thread from 2014 and the answer states that no, sklearn random forest classifier cannot handle categorical variables (or at least not directly). Has the answer changed in 2020?

I want to feed gender as a feature for my model. However, gender can take on three values: M , F of np.nan . If I encode this column into three dichotomous columns, how can the random forest classifier know that these three columns represent a single feature?

Imagine max_features = 7. When training a given tree, it will randomly randomly pick seven features. Suppose gender was chosen. If gender is split into three columns ( gender_M , gender_F , gender_NA ), will the random forest classifier always pick all three columns and count it as one feature, or is there a chance that it will only pick one or two?

1 answers

If max_features is set to a value lower than the actual amount of columns (which is the advisable approach, see the recommended values for max_features in the docs ), then yes, there is a chance that for a given estimator in the random forest only a subset of the dummy columns is considered.

But that is not necessarily too bad. In decision trees, a feature is selected as node at a given level aiming at optimizing some metric, independently from the other features, that is, only considering the actual feature and the target. So in a sense the model will not treat these dummy columns as belonging to the same feature .

In general though, the best approach for binary features is to come up with an appropriate method to fill missing values, and convert it into a single column encoded to 0 s and 1 s.

Can sklearn random forest directly handle categorical features?

Can sklearn Random Forest classifier adjust sample size by tree, to handle class imbalance?

Random Forest Classifier for Categorical Data?

Format of train/test for random forest classifier with categorical variables

sklearn calibrated classifier with random forest

strange behavior of sklearn random forest classifier

Sklearn - Cannot use encoded data in Random forest classifier

How to handle categorical variables in sklearn GradientBoostingClassifier?

random forest classifier visualization

Tuning Random Forest classifier

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Can sklearn random forest directly handle categorical features? Can sklearn Random Forest classifier adjust sample size by tree, to handle class imbalance? Random Forest Classifier for Categorical Data? Format of train/test for random forest classifier with categorical variables sklearn calibrated classifier with random forest strange behavior of sklearn random forest classifier Sklearn - Cannot use encoded data in Random forest classifier How to handle categorical variables in sklearn GradientBoostingClassifier? random forest classifier visualization Tuning Random Forest classifier

Related Tags

Can sklearn random forest classifier handle categorical variables?

Question

1 answers

solution1 1 ACCPTED 2020-04-30 16:54:10

solution1
1 ACCPTED 2020-04-30 16:54:10