randomForest分类预测限制

Question

我理解并赞赏R的randomForest函数只能处理少于54个类别的类别预测变量。 但是，当我将类别预测变量减少到少于54个类别时，我仍然会收到错误。 我所看到的关于stackoverflow的分类预测变量限制的唯一问题是如何绕开该类别限制，但是我正尝试减少类别数以遵循函数的限制，但仍然会出现错误。

下面的脚本创建一个数据框，以便我们可以预测“专业”。 可以理解，由于变量“ college_id”，在“ df”上尝试运行randomForest（）时，出现“无法处理超过53个类别的类别预测变量”错误。

但是，当我将数据集修整为仅包括前40个大学ID时，会遇到相同的错误。 我现在是否缺少一些保留所有类别的基本数据框概念，即使现在在“ df2”数据框中仅填充了40个类别？ 我可以使用什么解决方法？

library(dplyr)
library(randomForest)

# create data frame
df <- data.frame(profession = sample(c("accountant", "lawyer", "dentist"), 10000, replace = TRUE),
             zip = sample(c("32801", "32807", "32827", "32828"), 10000, replace = TRUE),
             salary = sample(c(50000:150000), 10000, replace = TRUE),
             college_id = as.factor(c(sample(c(1001:1040), 9200, replace = TRUE),
                                      sample(c(1050:9999), 800, replace = TRUE))))


# results in error, as expected
rfm <- randomForest(profession ~ ., data = df)


# arrange college_ids by count and retain the top 40 in the 'df' data frame
sdf <- df %>% 
  dplyr::group_by(college_id) %>% 
  dplyr::summarise(n = n()) %>% 
  dplyr::arrange(desc(n))
sdf <- sdf[1:40, ]
df2 <- dplyr::inner_join(df, sdf, by = "college_id")
df2$n <- NULL


# confirm that df2 only contains 40 categories of 'college_id'
nrow(df2[which(!duplicated(df2$college_id)), ])


# THIS IS WHAT I WANT TO RUN, BUT STILL RESULTS IN ERROR
rfm2 <- randomForest(profession ~ ., data = df2)

Answer 1

我认为您在变量中仍然具有所有因子水平。 在再次适合林之前，请尝试添加以下行：

df2$college_id <- factor(df2$college_id)

randomForest分类预测限制

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-06-30 13:13:54

randomForest分类预测限制

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-06-30 13:13:54

解决方案1
1 已采纳 2016-06-30 13:13:54