简体   繁体   English

如何使用 R 在 K 近邻中找到 K 的最佳值?

[英]How to find the optimal value for K in K-nearest neighbors using R?

My dataset contains 5851 observations, and is split into a train (3511 observations) and test (2340 observations) set.我的数据集包含 5851 个观察值,并分为训练(3511 个观察值)和测试(2340 个观察值)集。 I now want to train a model using KNN, with two variables.我现在想用 KNN 训练一个 model,有两个变量。 I want to do 10-fold CV, repeated 5 times, using ROC metric and the one-standard error rule and the variables are preprocessed.我想做 10 倍 CV,重复 5 次,使用 ROC 度量和一标准误差规则,并对变量进行预处理。 The code is shown below.代码如下所示。

set.seed(44780)
ctrl_repcvSE <- trainControl(method = "repeatedcv", number = 10, repeats = 5,
                           summaryFunction = twoClassSummary, classProbs = TRUE,
                           selectionFunction = "oneSE")
tune_grid <- expand.grid(k = 45:75)
mod4 <- train(purchased ~ total_policies + total_contrib,
              data = mhomes_train, method = "knn",
              trControl= ctrl_repcvSE, metric = "ROC",
              tuneGrid = tune_grid, preProcess = c("center", "scale"))

The problem I have is that I already have tried so many different values of K (eg, K = 10:20, 30:40, 50:60, 150:160 + different tuning lengths. However, every time the output says that the chosen value for K is the one which is last, so for example for values of K = 70:80, the chosen value for K = 80, every time I do this. This means I should look further, because if the chosen value is K in that case then there are better values of K available which are above 80. How should I eventually find this one?我遇到的问题是我已经尝试了很多不同的 K 值(例如,K = 10:20、30:40、50:60、150:160 + 不同的调谐长度。但是,每次 output 说K 的选择值是最后一个,例如对于 K = 70:80 的值,K = 80 的选择值,每次我这样做时。这意味着我应该看得更远,因为如果选择的值是在这种情况下,K 有更好的可用 K 值,高于 80。我最终应该如何找到这个?

The assignment only specifies: For k-nearest neighbours, explore reasonable values of k using the total_policies and total_contrib variables only.该分配仅指定:对于 k 近邻,仅使用 total_policies 和 total_contrib 变量探索 k 的合理值。

Welcome to Stack Overflow.欢迎来到堆栈溢出。 Your question isn't easy to answer.你的问题不容易回答。

For k-nearest neighbours I use another function knn3 part of the caret library.对于 k 最近的邻居,我使用caret库的另一个 function knn3部分。

I'll give an example using the iris dataset.我将举一个使用iris数据集的例子。 We try to get the accuracy of our model for different values for k and plot those accuracies.对于k和 plot 的不同值,我们尝试获得 model 的精度。

library(data.table)
library(tidyverse)
library(scales)
library(caret)

dt <- as.data.table(iris)

# converting and scaling data ----
dt$Species      <- dt$Species %>% as.factor()
dt$Sepal.Length <- dt$Sepal.Length %>% scale()
dt$Sepal.Width  <-  dt$Sepal.Width %>% scale()
dt$Petal.Length <- dt$Petal.Length %>% scale()
dt$Petal.Width  <-  dt$Petal.Width %>% scale()

# remove in the real run ----
set.seed(1234567)

# split data into train and test - 3:1 ----
train_index <- createDataPartition(dt$Species, p = 0.75, list = FALSE)
train <- dt[train_index, ]
test <- dt[-train_index, ]

# values to check for k ----
K_VALUES  <- 20:1
test_acc  <- numeric(0)
train_acc <- numeric(0)

# calculate different models for each value of k ----
for (x in K_VALUES){
  model <- knn3(Species ~ ., data = train, k = x)
  pred_test <- predict(model, test, type = "class")
  pred_test_acc <- confusionMatrix(table(pred_test,
                                         test$Species))$overall["Accuracy"]
  test_acc <- c(test_acc, pred_test_acc)

  pred_train <- predict(model, train, type = "class")
  pred_train_acc <- confusionMatrix(table(pred_train,
                                          train$Species))$overall["Accuracy"]
  train_acc <- c(train_acc, pred_train_acc)
}

data <- data.table(x = K_VALUES, train = train_acc, test = test_acc)

# plot a validation curve ----
plot_data <- gather(data, "type", "value", -x)
g <- qplot(x = x,
           y = value,
           data = plot_data,
           color = type,
           geom = "path",
           xlim = c(max(K_VALUES),min(K_VALUES)-1))
print(g)

Now find a k with a good accuracy for your test data.现在为您的测试数据找到一个准确度很高的k That's the value you're looking for.这就是您正在寻找的价值。

Disclosure: That's simplified but this approach should help you solving your problem.披露:这很简单,但这种方法应该可以帮助您解决问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM