简体繁体 English

尽管测试集的预测很低，但是可以使用训练集在R中使用randomForest确定变量重要性吗？

[英]Can training set be used to determine variable importance using randomForest in R although the prediction of testing set is quite low?

原文 2019-03-12 19:05:42 7 1 r/ random-forest/ training-data

I am using randomForest in R, I have a training model with R^2 of 0.94 , however , the prediction capacity for testing data is quite low. 我在R中使用randomForest，我的训练模型的R ^ 2为0.94，但是测试数据的预测能力很低。 I would like to know if I can still use this training model only for determining which variable is more important/effective for output prediction. 我想知道是否仍可以仅使用该训练模型来确定哪个变量对输出预测更重要/更有效。

Thanks 谢谢

1 个解决方案

Based on what little information you provide, the question is hard to answer (think about providing more detail and background). 根据您提供的信息很少，这个问题很难回答（考虑提供更多的细节和背景）。 Low prediction quality can result from wrong algorithm tuning, or it can be inherent in the data, ie your predictors themselves are not very strongly related to the outcome. 较低的预测质量可能是由于算法调整错误而导致的，也可能是数据固有的，即预测变量本身与结果之间的关系不是很紧密。 In the first case, the prediction could be better with different parameters, eg more or less trees, different values for mtry, etc. If this is the case, then your importance measures are just as biased as your prediction (and should be used with caution). 在第一种情况下，使用不同的参数（例如，更多或更少的树，mtry的不同值等）进行预测可能会更好。如果是这种情况，那么您的重要性度量与您的预测一样有偏见（应与警告）。 If the predictors themselves are weak, that means that your low quality prediction is as good as it gets. 如果预测变量本身较弱，则意味着您的低质量预测将达到预期效果。 In this case, I would say the importance measures can be used, but they only tell you which of your overall weak predictors are more or less weak. 在这种情况下，我会说可以使用重要性度量，但是它们只能告诉您总体弱预测变量中的哪个或多或少是弱的。