[英]What should be the type of categorical variable when using the function randomForest?
This is just a general theory question, I was asked this question in the college mock interviews for data science, I tried to search for this answer but was unable to get it elsewhere.这只是一个一般理论问题,我在大学数据科学模拟面试中被问到这个问题,我试图寻找这个答案,但无法在其他地方得到它。 Hope someone helps me with this.
希望有人能帮助我。 Also I dont have much hands on randomforest
另外我对随机森林的掌握不多
In terms of general theory , random forests can work with both numeric and categorical data.就一般理论而言,随机森林可以处理数字数据和分类数据。 The function
randomForest
( documentation here ) supports categorical data coded as factors, so that would be your type. function
randomForest
( 此处的文档)支持编码为因子的分类数据,因此这将是您的类型。
Machine learning algorithms require features to be encoded in numerical form.机器学习算法需要以数字形式对特征进行编码。 You can either one hot encode (0 or 1s) for each level of a feature to indicate its presence or you can label encode such that each level within a feature will then have a numerical value (1,2,3).
您可以对特征的每个级别进行一个热编码(0 或 1)以指示其存在,也可以对 label 进行编码,以便特征中的每个级别都有一个数值(1,2,3)。 Typically one-hot encoding is used as label encoding may appear to give an order to the feature.
通常使用 one-hot 编码,因为 label 编码可能会显示该功能的顺序。 A risk with one-hot encoding is that if you have too many features the feature space will expand too much resulting in a high-dimensional feature set which can be a challenge if not enough data is present.
one-hot 编码的一个风险是,如果你有太多的特征,特征空间会扩展太多,导致高维特征集,如果没有足够的数据,这可能是一个挑战。 Hence, some approaches only feature encode the most common levels of a feature.
因此,一些方法只对特征的最常见级别进行特征编码。
Sources: AceAI Interview Prep, Kaggle, An Introduction To Statistical Learning With Applications in R资料来源:AceAI 面试准备、Kaggle、R 应用程序统计学习简介
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.