简体   繁体   中英

Categorical Feature Encoding as Enum for Scikit-Learn

I am currently trying to preprocess a very large dataset with a lot of categorical features for Scikit-Learns' RandomForest Model (Regression). The nature of the categorical data requires to not have any ordinality added through encoding schemes. The H2o ML-Framework ( Link ) offers of enum -encoding which would suite perfectly for my data. However I rely on Scikit-Learns RandomForest.

Is anyone aware of some enum -encoding for Scikit-Learn Models? (One-Hot-Encoding is not an option)

Thanks in Advance!

There is only label-encoding, LabelEncoder , together with OHE available in sklearn. However, it does not provide the functionality that you want, as categories are simply encoded as integers and this is meaningful for ordinal categories only, I believe. I believe, in sklearn it is left up to models to implement such enum category treatment (because there are many models in sklearn and most of them would not be able to benefit from such encoding).

I think, LightGBM claims here that it implements internally such type of category treatment, but i'm actually not 100% sure if that is true. The advantage is that they have both RF and GBM tree builders, so you cab easily switch between those and it is faster than sklearn implementation.

Note also that CatBoost has a reach toolkit for internal category encoding, but I have zero experience with it so far.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM