![](/img/trans.png)
[英]Is there is a function that helps to predict missing values using k-NN in R?
[英]How to implement Knn-algorithm without using k-nn function in r?
有沒有鏈接可以參考? 其實我是編程的新手。 我不知道如何在沒有k-nn函數的情況下實現。我只使用knn函數找到示例代碼。
我已經舉例說明了算法如何工作以及如何在R中實現它。這是我的例子的數據。
students_known <- data.frame(
Major = c(rep("Arts",4),rep("Applied Science", 3),
rep("Education",3), rep("Science",6)),
Tuition = c(2000,2200,2100,1900,2800,3000,2900,
2500,2700,2600,3100,3200,3150,3000,
3175,3300),
GPA = c(3.55,3.40,3.30,3.50,2.90,3.05,2.50,
3.80,3.45,3.35,3.00,3.50,4.00,3.40,
3.45,3.30),
Age = c(20,18,22,21,24,23,21,19,19,21,20,18,17,21,24,23)
)
students_known
# Major Tuition GPA Age
# 1 Arts 2000 3.55 20
# 2 Arts 2200 3.40 18
# 3 Arts 2100 3.30 22
# 4 Arts 1900 3.50 21
# 5 Applied Science 2800 2.90 24
# 6 Applied Science 3000 3.05 23
# 7 Applied Science 2900 2.50 21
# 8 Education 2500 3.80 19
# 9 Education 2700 3.45 19
# 10 Education 2600 3.35 21
# 11 Science 3100 3.00 20
# 12 Science 3200 3.50 18
# 13 Science 3150 4.00 17
# 14 Science 3000 3.40 21
# 15 Science 3175 3.45 24
# 16 Science 3300 3.30 23
假設第3行和第11行的Major是未知的,我們想使用knn算法來估算Major。 在這種情況下,類是Major,我們用來計算距離的變量是學費,GPA和年齡(都是數字)。
students_unknown <- students_known
students_unknown[3,1] <- NA
students_unknown[11,1] <- NA
我在下面的函數中實現了一個knn算法。 該算法的步驟是:
重量行(可選)。 在這個例子中,如果行沒有加權,那么Tuition對距離的影響要大得多,那么GPA和Age就會大得多。
對於具有缺失值(Major)的每一行,計算每個完整行(已知Major)的總平方距離。
選擇具有最小平方距離的k行。 選擇構成這些k行中最大比例的Major。
使用此過程,可以為Major列填充缺失值。
# train_cols should be numeric; imp_col should represent class
knn <- function(data, imp_col, train_cols, k=5,
weight=FALSE) {
if(weight) {
col_means <- sapply(data[,train_cols],mean)
col_weights <- max(col_means)/col_means
for(i in 1:length(train_cols))
data[,train_cols[i]] <- data[,train_cols[i]]*col_weights[i]
}
data_complete <- data[complete.cases(data),]
data_incomplete <- data[!complete.cases(data),]
ncomplete <- length(data_complete[,1])
nincomplete <- length(data_incomplete[,1])
for(j in 1:nincomplete) {
d <- numeric(ncomplete)
for(i in train_cols)
d <- d + (data_complete[,i] - data_incomplete[j,i])^2
# indices of k-nearest neighbors
knn_index <- head(sort(d,index.return=T)$ix,k)
nn <- sort(table(data_complete[knn_index,imp_col]), T)
data_incomplete[j,imp_col]<- names(nn)[1]
}
data_incomplete
}
這是使用的算法。
data_actual <- students_known[c(3,11),]
data_imputed1 <- knn(students_unknown, imp_col=1, train_cols=2:4, k=3,
weight=FALSE)
data_imputed2 <- knn(students_unknown, imp_col=1, train_cols=2:4, k=3,
weight=TRUE)
print(as.character(data_actual$Major))
# [1] "Arts" "Science"
print(as.character(data_imputed1$Major)) # imputes properly
# [1] "Arts" "Science"
print(as.character(data_imputed2$Major)) # doesn't impute properly
# [1] "Arts" "Applied Science"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.