简体   繁体   English

使用 function randomForest 时,分类变量的类型应该是什么?

[英]What should be the type of categorical variable when using the function randomForest?

This is just a general theory question, I was asked this question in the college mock interviews for data science, I tried to search for this answer but was unable to get it elsewhere.这只是一个一般理论问题,我在大学数据科学模拟面试中被问到这个问题,我试图寻找这个答案,但无法在其他地方得到它。 Hope someone helps me with this.希望有人能帮助我。 Also I dont have much hands on randomforest另外我对随机森林的掌握不多

In terms of general theory , random forests can work with both numeric and categorical data.一般理论而言,随机森林可以处理数字数据和分类数据。 The function randomForest ( documentation here ) supports categorical data coded as factors, so that would be your type. function randomForest此处的文档)支持编码为因子的分类数据,因此这将是您的类型。

Machine learning algorithms require features to be encoded in numerical form.机器学习算法需要以数字形式对特征进行编码。 You can either one hot encode (0 or 1s) for each level of a feature to indicate its presence or you can label encode such that each level within a feature will then have a numerical value (1,2,3).您可以对特征的每个级别进行一个热编码(0 或 1)以指示其存在,也可以对 label 进行编码,以便特征中的每个级别都有一个数值(1,2,3)。 Typically one-hot encoding is used as label encoding may appear to give an order to the feature.通常使用 one-hot 编码,因为 label 编码可能会显示该功能的顺序。 A risk with one-hot encoding is that if you have too many features the feature space will expand too much resulting in a high-dimensional feature set which can be a challenge if not enough data is present. one-hot 编码的一个风险是,如果你有太多的特征,特征空间会扩展太多,导致高维特征集,如果没有足够的数据,这可能是一个挑战。 Hence, some approaches only feature encode the most common levels of a feature.因此,一些方法只对特征的最常见级别进行特征编码。

Sources: AceAI Interview Prep, Kaggle, An Introduction To Statistical Learning With Applications in R资料来源:AceAI 面试准备、Kaggle、R 应用程序统计学习简介

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 怎么修 ”'。' 在公式中并且没有“数据”参数”使用randomForest函数时? - How to fix “'.' in formula and no 'data' argument” when using randomForest function? randomForest分类预测限制 - randomForest Categorical Predictor Limits R中RandomForest包中的RandomForest函数中的参数'classwt'代表什么? - What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for? 使用`transform`函数对分类变量中的值进行排序 - order values in categorical variable using `transform` function 如何使用randomForest处理超过53个因子水平的分类值? - How to handle with categorical values over 53 factor levels using randomForest? 使用 ggplot 创建正方形(在 y 尺度中使用分类变量时,我的 y 轴的高度是多少?) - Creating squares with ggplot (what is the height of my y-axis when using categorical variable in y-scale?) 变量重要性 plot 在 R 中使用随机森林 package - Variable importance plot using randomforest package in R 在R中使用randomForest输入类型不匹配错误 - Type Mismatch Error using randomForest in R 在R中使用randomForest遍历变量的值 - Looping through values of a variable using randomForest in R 使用插入符号时,createGrid for RF(randomForest)出现错误 - Errors with createGrid for rf (randomForest) when using caret
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM