[英]How to find the optimal value for K in K-nearest neighbors using R?
My dataset contains 5851 observations, and is split into a train (3511 observations) and test (2340 observations) set.我的数据集包含 5851 个观察值,并分为训练(3511 个观察值)和测试(2340 个观察值)集。 I now want to train a model using KNN, with two variables.
我现在想用 KNN 训练一个 model,有两个变量。 I want to do 10-fold CV, repeated 5 times, using ROC metric and the one-standard error rule and the variables are preprocessed.
我想做 10 倍 CV,重复 5 次,使用 ROC 度量和一标准误差规则,并对变量进行预处理。 The code is shown below.
代码如下所示。
set.seed(44780)
ctrl_repcvSE <- trainControl(method = "repeatedcv", number = 10, repeats = 5,
summaryFunction = twoClassSummary, classProbs = TRUE,
selectionFunction = "oneSE")
tune_grid <- expand.grid(k = 45:75)
mod4 <- train(purchased ~ total_policies + total_contrib,
data = mhomes_train, method = "knn",
trControl= ctrl_repcvSE, metric = "ROC",
tuneGrid = tune_grid, preProcess = c("center", "scale"))
The problem I have is that I already have tried so many different values of K (eg, K = 10:20, 30:40, 50:60, 150:160 + different tuning lengths. However, every time the output says that the chosen value for K is the one which is last, so for example for values of K = 70:80, the chosen value for K = 80, every time I do this. This means I should look further, because if the chosen value is K in that case then there are better values of K available which are above 80. How should I eventually find this one?我遇到的问题是我已经尝试了很多不同的 K 值(例如,K = 10:20、30:40、50:60、150:160 + 不同的调谐长度。但是,每次 output 说K 的选择值是最后一个,例如对于 K = 70:80 的值,K = 80 的选择值,每次我这样做时。这意味着我应该看得更远,因为如果选择的值是在这种情况下,K 有更好的可用 K 值,高于 80。我最终应该如何找到这个?
The assignment only specifies: For k-nearest neighbours, explore reasonable values of k using the total_policies and total_contrib variables only.该分配仅指定:对于 k 近邻,仅使用 total_policies 和 total_contrib 变量探索 k 的合理值。
Welcome to Stack Overflow.欢迎来到堆栈溢出。 Your question isn't easy to answer.
你的问题不容易回答。
For k-nearest neighbours I use another function knn3
part of the caret
library.对于 k 最近的邻居,我使用
caret
库的另一个 function knn3
部分。
I'll give an example using the iris
dataset.我将举一个使用
iris
数据集的例子。 We try to get the accuracy of our model for different values for k
and plot those accuracies.对于
k
和 plot 的不同值,我们尝试获得 model 的精度。
library(data.table)
library(tidyverse)
library(scales)
library(caret)
dt <- as.data.table(iris)
# converting and scaling data ----
dt$Species <- dt$Species %>% as.factor()
dt$Sepal.Length <- dt$Sepal.Length %>% scale()
dt$Sepal.Width <- dt$Sepal.Width %>% scale()
dt$Petal.Length <- dt$Petal.Length %>% scale()
dt$Petal.Width <- dt$Petal.Width %>% scale()
# remove in the real run ----
set.seed(1234567)
# split data into train and test - 3:1 ----
train_index <- createDataPartition(dt$Species, p = 0.75, list = FALSE)
train <- dt[train_index, ]
test <- dt[-train_index, ]
# values to check for k ----
K_VALUES <- 20:1
test_acc <- numeric(0)
train_acc <- numeric(0)
# calculate different models for each value of k ----
for (x in K_VALUES){
model <- knn3(Species ~ ., data = train, k = x)
pred_test <- predict(model, test, type = "class")
pred_test_acc <- confusionMatrix(table(pred_test,
test$Species))$overall["Accuracy"]
test_acc <- c(test_acc, pred_test_acc)
pred_train <- predict(model, train, type = "class")
pred_train_acc <- confusionMatrix(table(pred_train,
train$Species))$overall["Accuracy"]
train_acc <- c(train_acc, pred_train_acc)
}
data <- data.table(x = K_VALUES, train = train_acc, test = test_acc)
# plot a validation curve ----
plot_data <- gather(data, "type", "value", -x)
g <- qplot(x = x,
y = value,
data = plot_data,
color = type,
geom = "path",
xlim = c(max(K_VALUES),min(K_VALUES)-1))
print(g)
Now find a k
with a good accuracy for your test data.现在为您的测试数据找到一个准确度很高的
k
。 That's the value you're looking for.这就是您正在寻找的价值。
Disclosure: That's simplified but this approach should help you solving your problem.披露:这很简单,但这种方法应该可以帮助您解决问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.