简体   繁体   English

R 中随机森林的准确度预测不佳

[英]Poor Accuracy Prediction with random forest in R

I'm a newbie in R, I'm trying to predict a Customer Type (member or normal customer in the store) in relation to different variables (Gender, total spent, rating, ...) with 1000 customers information in my dataframe.我是 R 的新手,我正在尝试根据我的 dataframe 中的 1000 个客户信息来预测与不同变量(性别、总支出、评级等)相关的客户类型(商店中的会员或普通客户) . I created an algorithm with a random forest but the accuracy is around 49% (OOB error rate).我创建了一个带有随机森林的算法,但准确率约为 49%(OOB 错误率)。 I tried to use Importance(RFM) in order to get higher accuracy by not including not relevant variables but I end up with around a 51% accuracy... Does it mean there is no connection between all the features or is there a way to tune it to get higher accuracy?我尝试使用 Importance(RFM) 来通过不包括不相关的变量来获得更高的准确度,但我最终得到了大约 51% 的准确度......这是否意味着所有功能之间没有联系,或者有没有办法调整它以获得更高的准确性? Thank you so much.太感谢了。

#Creating a vector that has random sample of training values (70% & 30% samples)
index = sample(2,nrow(df), replace = TRUE, prob=c(0.7,0.3)) 

#Training data
training = df[index==1,]

#Testing data
testing = df[index==2,]

#Random forest model 
RFM = randomForest(as.factor(Customer_type)~., data = training, ntree = 500, do.trace=T)
importance(RFM)

# Evaluating Model Accuracy
customertype_pred = predict(RFM, testing)
testing$customertype_pred = customertype_pred
View(testing)

#Building confusion Matrix to compare
CFM = table(testing$Customer_type, testing$customertype_pred)
CFM```

Without your data or a reproducible example, it is hard to really improve your model.如果没有您的数据或可重复的示例,很难真正改进您的 model。 I can suggest to you some procedures and packages that can help you a lot in this kind of task.我可以向您建议一些程序和软件包,它们可以在此类任务中为您提供很多帮助。 Have a look at the caret package, which is designed precisely for model tuning.查看caret package,它专为 model 调整而设计。 The package is really well documented with lots of useful examples. package有很多有用的例子。 Here I can show the general workflow to work with caret :在这里,我可以展示使用caret的一般工作流程:

#load library and the data for this example
library(caret)
#this is a caret built-in dataset
data(GermanCredit)
df <- GermanCredit[,1:10]
str(GermanCredit)
#caret offers useful function for data splitting. Here we split the data according to 
#the class column (our outcome to be predict), in 80% training and 20% testing data
ind <- createDataPartition(df$Class,p=0.8,list = F)
training <- df[ind,]
test <- df[-ind,]

#here we set the resampling method for hyperparameters tuning
#in this case we choose 10-fold cross validation
cn <- trainControl(method = "cv",number = 10)
#the grid of hyperparameters with which to tune the model
grid <- expand.grid(mtry=2:(ncol(training)-1))

#here is the proper model fitting. We fit a random forests model (method="rf") using 
#Class as outcome and all other variables as predictors, using the selected resampling 
#method and tuning grid
fit <-
train(
Class ~ .,
data = training,
method = "rf",
trControl = cn,
tuneGrid = grid
)

the output of the model look like this: model 的 output 看起来像这样:

Random Forest 

800 samples
9 predictor
2 classes: 'Bad', 'Good' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
Resampling results across tuning parameters:

mtry  Accuracy  Kappa    
2     0.71125   0.1511164
3     0.70875   0.1937589
4     0.70000   0.1790469
5     0.70000   0.1819945
6     0.70375   0.1942889
7     0.70250   0.1955456
8     0.70625   0.2025015
9     0.69750   0.1887295

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

As you can see the function train built a randomForests model for each value of the tuning parameter (in this case only mtry ) and choose the best parameter setting according to the model with the maximum accuracy.如您所见,function train为每个调整参数值(在这种情况下只有mtry )构建了一个randomForests森林 model,并根据 Z20F35E630DAF44DBFA4C3F68F68F53 选择最佳参数设置。 The final parameters setting is used to build the final model with all the data supplied into train (in this case with all observation of the training data.frame).最终的参数设置用于构建最终的 model,并将所有数据提供到train中(在这种情况下,所有观察training data.frame)。 The output gives the resampling performance, which is usually optimistic. output 给出了重采样性能,这通常是乐观的。 To test the accuracy of the model against the test set we can do:要针对test集测试 model 的准确性,我们可以执行以下操作:

#predict the output on the test set.
p <- predict(fit,test[,-10])
#this function built a confusion matrix and calculate a lot of accuracies statistics
confusionMatrix(p,test$Class)

you can add particular arguments of the chosen model (in this case randomForest ) to the function with the ... argument of train .您可以使用train...参数将所选 model (在本例中为randomForest )的特定 arguments 添加到 function 中。 Like this:像这样:

fit <-
train(
Class ~ .,
data = training,
method = "rf",
trControl = cn,
tuneGrid = grid,
ntrees=200# grown 200 trees
)

To find the best set of variables (also known as variables selection or features selection) caret has a lot of functions that can be helpful.为了找到最好的变量集(也称为变量选择或特征选择), caret有很多有用的功能。 There is an entire section of the vignette for variable selection in the package, including simple filters, backwards selection, recursive feature elimination, genetic algorithms, simulated annealing, and obviously the built-in feature selection method for lots of models (like variable importance for randomForest ). package 中有一个完整的 变量选择部分,包括简单的过滤器、反向选择、递归特征消除、遗传算法、模拟退火,显然还有许多模型的内置特征选择方法(如变量重要性randomForest )。 However, feature selection is a huge topic, I suggest to you to start with the method in the caret package and digging deeper if you don't find what you are looking for.然而,特征选择是一个很大的话题,我建议你从caret package 中的方法开始,如果你没有找到你要找的东西,那就深入挖掘。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM