简体   繁体   中英

Poor Accuracy Prediction with random forest in R

I'm a newbie in R, I'm trying to predict a Customer Type (member or normal customer in the store) in relation to different variables (Gender, total spent, rating, ...) with 1000 customers information in my dataframe. I created an algorithm with a random forest but the accuracy is around 49% (OOB error rate). I tried to use Importance(RFM) in order to get higher accuracy by not including not relevant variables but I end up with around a 51% accuracy... Does it mean there is no connection between all the features or is there a way to tune it to get higher accuracy? Thank you so much.

#Creating a vector that has random sample of training values (70% & 30% samples)
index = sample(2,nrow(df), replace = TRUE, prob=c(0.7,0.3)) 

#Training data
training = df[index==1,]

#Testing data
testing = df[index==2,]

#Random forest model 
RFM = randomForest(as.factor(Customer_type)~., data = training, ntree = 500, do.trace=T)
importance(RFM)

# Evaluating Model Accuracy
customertype_pred = predict(RFM, testing)
testing$customertype_pred = customertype_pred
View(testing)

#Building confusion Matrix to compare
CFM = table(testing$Customer_type, testing$customertype_pred)
CFM```

Without your data or a reproducible example, it is hard to really improve your model. I can suggest to you some procedures and packages that can help you a lot in this kind of task. Have a look at the caret package, which is designed precisely for model tuning. The package is really well documented with lots of useful examples. Here I can show the general workflow to work with caret :

#load library and the data for this example
library(caret)
#this is a caret built-in dataset
data(GermanCredit)
df <- GermanCredit[,1:10]
str(GermanCredit)
#caret offers useful function for data splitting. Here we split the data according to 
#the class column (our outcome to be predict), in 80% training and 20% testing data
ind <- createDataPartition(df$Class,p=0.8,list = F)
training <- df[ind,]
test <- df[-ind,]

#here we set the resampling method for hyperparameters tuning
#in this case we choose 10-fold cross validation
cn <- trainControl(method = "cv",number = 10)
#the grid of hyperparameters with which to tune the model
grid <- expand.grid(mtry=2:(ncol(training)-1))

#here is the proper model fitting. We fit a random forests model (method="rf") using 
#Class as outcome and all other variables as predictors, using the selected resampling 
#method and tuning grid
fit <-
train(
Class ~ .,
data = training,
method = "rf",
trControl = cn,
tuneGrid = grid
)

the output of the model look like this:

Random Forest 

800 samples
9 predictor
2 classes: 'Bad', 'Good' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
Resampling results across tuning parameters:

mtry  Accuracy  Kappa    
2     0.71125   0.1511164
3     0.70875   0.1937589
4     0.70000   0.1790469
5     0.70000   0.1819945
6     0.70375   0.1942889
7     0.70250   0.1955456
8     0.70625   0.2025015
9     0.69750   0.1887295

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

As you can see the function train built a randomForests model for each value of the tuning parameter (in this case only mtry ) and choose the best parameter setting according to the model with the maximum accuracy. The final parameters setting is used to build the final model with all the data supplied into train (in this case with all observation of the training data.frame). The output gives the resampling performance, which is usually optimistic. To test the accuracy of the model against the test set we can do:

#predict the output on the test set.
p <- predict(fit,test[,-10])
#this function built a confusion matrix and calculate a lot of accuracies statistics
confusionMatrix(p,test$Class)

you can add particular arguments of the chosen model (in this case randomForest ) to the function with the ... argument of train . Like this:

fit <-
train(
Class ~ .,
data = training,
method = "rf",
trControl = cn,
tuneGrid = grid,
ntrees=200# grown 200 trees
)

To find the best set of variables (also known as variables selection or features selection) caret has a lot of functions that can be helpful. There is an entire section of the vignette for variable selection in the package, including simple filters, backwards selection, recursive feature elimination, genetic algorithms, simulated annealing, and obviously the built-in feature selection method for lots of models (like variable importance for randomForest ). However, feature selection is a huge topic, I suggest to you to start with the method in the caret package and digging deeper if you don't find what you are looking for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM