线性判别分析变量重要性

Question

Using the R MASS package to do a linear discriminant analysis, is there a way to get a measure of variable importance? 使用R MASS软件包进行线性判别分析，是否有办法获得变量重要性的度量？

Library(MASS)
### import data and do some preprocessing
fit <- lda(cat~., data=train)

I have is a data set with about 20 measurements to predict a binary category. 我有一个约有20个测量值的数据集，可预测一个二进制类别。 But the measurements are hard to obtain so I want to reduce the number of measurements to the most influential. 但是很难获得测量值，因此我想将测量数量减少到最有影响力。

When using rpart or randomForests I can get a list of variable importance, or a gimi decrease stat using summary() or importance(). 当使用rpart或randomForests时，我可以使用summary（）或重要性（）获得变量重要性列表，或gimi减少统计信息。

Is there a built in function to do this that I cant find? 是否有内置函数无法执行此操作？ Or if I have to code one, what would be a good way to go about it? 或者，如果我必须编写一个代码，什么是解决该问题的好方法？

Answer 1

I would recommend to use the "caret" package. 我建议使用“插入符”包。

library(caret)
data(mdrr)
mdrrDescr <- mdrrDescr[, -nearZeroVar(mdrrDescr)]
mdrrDescr <- mdrrDescr[, -findCorrelation(cor(mdrrDescr), .8)]
set.seed(1)
inTrain <- createDataPartition(mdrrClass, p = .75, list = FALSE)[,1]
train <- mdrrDescr[ inTrain, ]
test  <- mdrrDescr[-inTrain, ]
trainClass <- mdrrClass[ inTrain]
testClass  <- mdrrClass[-inTrain]

set.seed(2)
ldaProfile <- rfe(train, trainClass,
                  sizes = c(1:10, 15, 30),
                  rfeControl = rfeControl(functions = ldaFuncs, method = "cv"))


postResample(predict(ldaProfile, test), testClass)

Once the variable "ldaProfile" is created you can retrieve the best subset of variables and its description: 创建变量“ ldaProfile”后，您可以检索变量的最佳子集及其说明：

ldaProfile$optVariables
[1] "X5v"    "VRA1"   "D.Dr06" "Wap"    "G1"     "Jhetm"  "QXXm"  
[8] "nAB"    "H3D"    "nR06"   "TI2"    "nBnz"   "Xt"     "VEA1"  
[15] "TIE"

Also you can get a nice plot of used variables vs. Accuracy. 此外，您还可以得出使用变量与精度的关系图。

Answer 2

One option would be to employ permutation importance. 一种选择是采用置换重要性。

Fit the LDA model then randomly permute each feature column with a different column chosen at random and compare the resulting prediction score with baseline (non-permuted) score. 拟合LDA模型，然后用随机选择的不同列对每个特征列进行随机置换，并将所得的预测分数与基线（非置换）分数进行比较。

The more the permuted score is reduced relative to the baseline score, the more important that feature is. 相对于基线分数，排列分数降低得越多，该功能就越重要。 Then you can select a cutoff and take only those features for which the permuted score - baseline score is above the given threshold. 然后，您可以选择一个截止值，并且仅采用排列分数-基线分数高于给定阈值的那些特征。

There is a nice tutorial on kaggle for this topic. 关于kaggle，有一个很好的关于该主题的教程。 It uses python instead of R, but the concept is directly applicable here. 它使用python而不是R，但是该概念在此直接适用。

https://www.kaggle.com/dansbecker/permutation-importance https://www.kaggle.com/dansbecker/permutation-importance

线性判别分析变量重要性

问题描述

2 个解决方案

解决方案1
1 2014-05-28 03:19:32

解决方案2
0 2019-07-24 20:24:50

线性判别分析变量重要性

问题描述

2 个解决方案

解决方案1 1 2014-05-28 03:19:32

解决方案2 0 2019-07-24 20:24:50

解决方案1
1 2014-05-28 03:19:32

解决方案2
0 2019-07-24 20:24:50