简体   繁体   English

如何计算R中的KNN变量重要性

[英]How to calculate KNN Variable Importance in R

I implemented an Authorship attribution project where I was able to train my KNN model with articles from two authors using KNN. 我实施了一个作者归属项目,在那里我能够用两个使用KNN的作者的文章训练我的KNN模型。 Then, I classify the author of a new article to be either author A or author B. I use knn() function to generate the model. 然后,我将新文章的作者分类为作者A或作者B.我使用knn()函数生成模型。 The output of the model is the table below. 模型的输出如下表所示。

   Word1 Word2 Word3  Author
11    1     48    8      A
2     2     0     0      B
29    1     45    9      A
1     2     0     0      B
4     0     0     0      B
28    3     1     1      B

As seen from the model, it is obvious to see that Word2 and Word3 are the most significant variables that cause the classification between Author A and Author B. 从模型中可以看出,很明显Word2和Word3是导致作者A和作者B之间分类的最重要的变量。

My question is how can I identify this using R. 我的问题是如何使用R识别这个。

Basically, your question boils down to having some variables (Word1, Word2, and Word3 in your example) and a binary outcome (Author in your example) and wanting to know the importance of different variables in determining that outcome. 基本上,你的问题归结为有一些变量(你的例子中的Word1,Word2和Word3)和二进制结果(你的例子中的作者),并想知道不同变量在确定结果时的重要性。 A natural approach would be training a regression model to predict the outcome using the variables and to check the variable importance in that model. 一种自然的方法是训练回归模型以使用变量预测结果并检查该模型中的变量重要性。 I'll include two approaches (logistic regression and random forest) here, but many others could be used. 我将在这里包括两种方法(逻辑回归和随机森林),但可以使用许多其他方法。

Let's start with a slightly larger example, in which the outcome only depends on Word2 and Word3, and Word2 has a much larger effect than Word3: 让我们从稍微大一些的例子开始,其中结果仅依赖于Word2和Word3,而Word2的效果远远大于Word3:

set.seed(144)
dat <- data.frame(Word1=rnorm(10000), Word2=rnorm(10000), Word3=rnorm(10000))
dat$Author <- ifelse(runif(10000) < 1/(1+exp(-10*dat$Word2+dat$Word3)), "A", "B")

We can use the summary of the logistic regression model predicting Author to determine the most important variables: 我们可以使用逻辑回归模型的摘要来预测作者来确定最重要的变量:

summary(glm(I(Author=="A")~., data=dat, family="binomial"))
# [snip]
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept)  0.05117    0.04935   1.037    0.300    
# Word1       -0.02123    0.04926  -0.431    0.666    
# Word2        9.52679    0.26895  35.422   <2e-16 ***
# Word3       -0.97022    0.05629 -17.236   <2e-16 ***

From the p-values, we can see that Word2 has a large positive effect and Word3 has a large negative effect. 从p值,我们可以看出Word2具有很大的正面效果,而Word3具有很大的负面影响。 From the coefficients we can see that Word2 has a higher magnitude of effect on the outcome (since by construction we know all the variables are on the same scale). 从系数我们可以看出,Word2对结果的影响程度更高(因为通过构造,我们知道所有变量都在相同的范围内)。

We can use the variable importance from a random forest predicting the Author outcome similarly: 我们可以使用随机森林中的变量重要性来预测作者结果:

library(randomForest)
rf <- randomForest(as.factor(Author)~., data=dat)
rf$importance
#       MeanDecreaseGini
# Word1         294.9039
# Word2        4353.2107
# Word3         351.3268

We can identify Word2 as by far the most important variable. 我们可以将Word2识别为迄今为止最重要的变量。 This tells us something else that's interesting -- given that we know Word2, Word3 actually isn't too much more useful than Word1 in predicting the outcome (and Word1 shouldn't be too useful because it wasn't used to compute the outcome). 这告诉我们其他有趣的东西 - 鉴于我们知道Word2,Word3在预测结果方面实际上并不比Word1有用得多(而且Word1不应该太有用,因为它不用于计算结果) 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM