简体   繁体   中英

What should be the type of categorical variable when using the function randomForest?

This is just a general theory question, I was asked this question in the college mock interviews for data science, I tried to search for this answer but was unable to get it elsewhere. Hope someone helps me with this. Also I dont have much hands on randomforest

In terms of general theory , random forests can work with both numeric and categorical data. The function randomForest ( documentation here ) supports categorical data coded as factors, so that would be your type.

Machine learning algorithms require features to be encoded in numerical form. You can either one hot encode (0 or 1s) for each level of a feature to indicate its presence or you can label encode such that each level within a feature will then have a numerical value (1,2,3). Typically one-hot encoding is used as label encoding may appear to give an order to the feature. A risk with one-hot encoding is that if you have too many features the feature space will expand too much resulting in a high-dimensional feature set which can be a challenge if not enough data is present. Hence, some approaches only feature encode the most common levels of a feature.

Sources: AceAI Interview Prep, Kaggle, An Introduction To Statistical Learning With Applications in R

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM