简体   繁体   English

可能过度拟合的分类树,但预测误差稳定

[英]Possibly overfitted classification tree but with stable prediction error

I have a question regarding rpart and overfitting. 我对rpart和过度拟合有疑问。 My goal is only to do well on prediction. 我的目标只是在预测上做得好。 My dataset is large, almost 20000 points. 我的数据集很大,几乎是20000点。 Using around 2.5% of these points as training I get a prediction error around 50%. 使用约2.5%的点作为训练,我得到约50%的预测误差。 But using 97.5% of the data as training I get around 30%. 但是使用97.5%的数据作为训练,我得到了大约30%。 Since I am using so much data for training I guess there is a risk for overfitting. 由于我在训练中使用了大量数据,因此我认为存在过度拟合的风险。

I run this 1000 times with random training/test data + pruning the tree which is some sort of cross validation if I have understood it correctly, and I get pretty much stable results (same prediction error and importance of variables). 我使用随机训练/测试数据+修剪树运行了1000次,如果我正确理解它的话,这是一种交叉验证,并且我得到的结果非常稳定(相同的预测误差和变量的重要性)。

Can overfitting still be a problem, even though I have run this 1000 times and the prediction error is stable? 即使我已经运行1000次并且预测误差是稳定的,过度拟合仍然会是一个问题吗?

I also have a question regarding correlation between my explanatory variables. 我还对我的解释变量之间的相关性有疑问。 Can that be a problem in CART (as with regression)? 这在CART中是否会成为问题(与回归一样)? In regression I would maybe use Lasso to try to fix the correlation. 在回归中,我可能会使用套索尝试修正相关性。 How can I fix the correlation with my classification tree? 如何修复与分类树的相关性?

When I plot the cptree I get this graph: 当我绘制cptree时,我得到此图:

cptree plot cptree图

Here is the code I am running (I have repeated this 1000 times with different random splits each time). 这是我正在运行的代码(我已经用不同的随机分割重复了1000次)。

set.seed(1) # For reproducability
train_frac = 0.975
n = dim(beijing_data)[1]

# Split into training and testing data
ii = sample(seq(1,dim(beijing_data)[1]),n*train_frac)
data_train = beijing_data[ii,]
data_test = beijing_data[-ii,]

fit = rpart(as.factor(PM_Dongsi_levels)~DEWP+HUMI+PRES+TEMP+Iws+
              precipitation+Iprec+wind_dir+tod+pom+weekend+month+
              season+year+day,
            data = data_train, minsplit = 0, cp = 0)

plotcp(fit)

# Find the split with minimum CP and prune the tree
cp_fit = fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
pfit = prune(fit, cp = cp_fit)
pp <- predict(pfit, newdata = data_test, type = "class")

err = sum(data_test[,"PM_Dongsi_levels"] != pp)/length(pp)
print(err)

Link to beijing_data (as a RData-file so you can reproduce my example) https://www.dropbox.com/s/6t3lcj7f7bqfjnt/beijing_data.RData?dl=0 链接到beijing_data(作为RData文件,以便您可以重现我的示例) https://www.dropbox.com/s/6t3lcj7f7bqfjnt/beijing_data.RData?dl=0

The question is quite complex and it will be very hard to comprehensively answer. 这个问题非常复杂,很难全面回答。 I will try to provide some insights and references for further reading. 我将尝试提供一些见识和参考,以供进一步阅读。

  • Correlated features do not pose a severe problem for tree based methods as they do for models that use a hyper-plane as classification boundaries. 对于基于树的方法,相关特征不会像使用超平面作为分类边界的模型那样严重。 When there are multiple correlated features the tree will just pick one and the rest will be ignored. 当有多个相关特征时,树将只选择一个,其余的将被忽略。 However correlated features often cloud the interpretability of such a model, mask interaction and so on. 但是,相关的特征通常会模糊这种模型的解释性,蒙版交互作用等。 Tree based models can also benefit from the removal of such variables since they will have to search a lesser space. 基于树的模型也可以从此类变量的删除中受益,因为它们将不得不搜索较小的空间。 Here is a decent resource on trees. 是树木上的体面资源。 Also check these videos 1 , 2 and 3 and the ISLR book. 还要检查这些视频123ISLR书。

  • Models based on one tree tend to not perform as good as hyper plane based methods. 基于一棵树的模型的性能往往不如基于超平面的方法。 So if you are interested mainly in the quality of prediction then you should explore models based on a bunch of trees such as bagging and boosting models. 因此,如果您主要对预测质量感兴趣,则应基于一堆树探索模型,例如装袋模型和增强模型。 Popular implementations of bagging and boosting in R are randomForest and xgboost . R中的套袋和增强的流行实现是randomForestxgboost Both can be utilized with little to no experience and can result in good predictions. 两者都几乎没有经验就可以利用,并且可以带来良好的预测。 Here is a resource on how to use the popular R machine learning library caret to tune a random forest. 是有关如何使用流行的R机器学习库插入符号来调整随机森林的资源。 Another resource is the R mlr library which provides great wrappers for many great things related to ML, for instance here is a short blog post on Model based optimization of xgboost. 另一个资源是R mlr库,它为与ML相关的许多重要事物提供了出色的包装,例如, 这是一篇有关基于模型的xgboost优化的简短博客文章

  • Re-sampling strategy for model validation varies with task and available data. 用于模型验证的重采样策略随任务和可用数据的不同而不同。 With 20 k rows I would probably use over 50 - 60 % for training, 20 % for validation and 20 -30 % as test set. 对于20k行,我可能会使用50%-60%以上的时间进行训练,使用20%以上的时间进行验证,并使用20-30%的时间作为测试集。 The 50 % test set I would use to select a suitable ML method, features, hyper parameters and so on by repeated K-fold cross validation (2-3 times repeated 4-5 - fold or similar). 我将使用50%的测试集,通过重复的K倍交叉验证(2-3次重复4-5倍或类似)来选择合适的ML方法,特征,超参数等。 The 20 % validation set I would use to fine tune stuff and to get a feel on how good my cross validation on the train set generalizes. 我将使用20%的验证集微调内容,以了解我在训练集上的交叉验证的普遍程度。 When I am satisfied with everything I would use the test set as a final proof I have a good model. 当我对一切都感到满意时,我会使用测试集作为最终证明,我有一个很好的模型。 Here are some resources on re-sampling: 1 , 2 , 3 and nested resampling . 下面是关于重采样一些资源: 123嵌套重采样

In your situation I would use 在您的情况下,我会使用

z <- caret::createDataPartition(data$y, p = 0.6, list = FALSE)
train <- data[z,]
test <- data[-z,]

to split the data to train and test sets, I would then repeat the process to split the test set again with p = 0.5 . 分割数据以训练和测试集,然后我将重复此过程以p = 0.5再次分割测试集。

On the train data I would use this tutorial on random forests to tune the mtry and ntree parameters (Extend Caret section) using 5 fold repeated cross validation in caret and a grid search. 在火车数据上,我将在随机森林上使用教程,通过在插入符号和网格搜索中进行5次重复交叉验证来调整mtryntree参数(扩展插入符号部分)。

control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

tunegrid <- expand.grid(.mtry = c(1:15), .ntree = c(200, 500, 700, 1000, 1200, 1500))

and so on, as detailed in the mentioned link. 等等,如上述链接中所述。

On a final note, the more data you have to train on, the less likely you are to over-fit. 最后一点,您需要训练的数据越多,适合的可能性就越小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM