简体   繁体   English

K最近邻居问题

[英]K Nearest Neighbor Questions

Hi I am having trouble understanding the workings of the K nearest neighbor algorithm specifically when trying to implement it in code. 嗨,我在尝试在代码中实现K最近邻居算法时遇到了麻烦。 I am implementing this in R but just want to know the workings, I'm not so much worried about the code as much as the process. 我正在R中实现此功能,但只想了解其工作原理,因此我对代码的关注程度不像对过程那样担心。 I will post what I have, my data, and what my questions are: 我将发布我所拥有的,我的数据以及我的问题是什么:

Training Data (just a portion of it): 

Feature1 | Feature2  | Class
   2     |     2     |   A
   1     |     4     |   A
   3     |     10    |   B
   12    |     100   |   B
   5     |     5     |   A

So far in my code: 到目前为止,在我的代码中:

kNN <- function(trainingData, sampleToBeClassified){

    #file input
    train <- read.table(trainingData,sep=",",header=TRUE)
    #get the data as a matrix (every column but the class column)
    labels <- as.matrix(train[,ncol(train)])
    #get the classes (just the class column)
    features <- as.matrix(train[,1:(ncol(train)-1)])
}

And for this I am calculating the "distance" using this formula: 为此,我使用以下公式计算“距离”:

distance <- function(x1,x2) {
   return(sqrt(sum((x1 - x2) ^ 2)))
}

So is the process for the rest of the algorithm as follows:? 那么其余算法的过程如下:

1.Loop through every data (in this case every row for the 2 columns) and calculate the distance from the one number at a time and compare it to the sampleToBeClassified? 1.遍历每个数据(在这种情况下,每行2列),计算一次与一个数字的距离,并将其与sampleToBeClassified?

2.In the starting case that I want 1 nearest-neighbor classification, would I just be storing the variable that has the least distance to my sampleToBeClassified? 2.在开始的情况下,我想要1个最近邻分类,我是否只存储与sampleToBeClassified距离最小的变量?

3.Whatever the closest distance variable is find out what class it is, then that class becomes the class of the sampleToBeClassified? 3.无论最接近的距离变量是什么,找出那个类,然后该类成为sampleToBeClassified的类?

My main question is what role do the features play in this? 我的主要问题是功能在其中起什么作用? My instinct is that the two features together are what defines that data item as a certain class, so what should I be calculating the distance between? 我的直觉是,这两个功能共同将那个数据项定义为某个类,那么我应该计算什么呢?

Am I on the right track at all? 我完全走对了吗? Thanks 谢谢

It looks as though you're on the right track. 看来您在正确的轨道上。 The three steps in your process seem to be correct for the 1-nearest neighbor cases. 您的过程中的三个步骤似乎对于最近的1个邻居案例是正确的。 For kNN, you just need to make a list of the k nearest neighbors and then determine which class is most prevalent in that list. 对于kNN,您只需要列出k个最近的邻居,然后确定该列表中最常见的类。

As for features, these are just attributes that define each instance and (hopefully) give us an indication as to what class they belong to. 至于功能,这些只是定义每个实例的属性,并且(希望)为我们提供了它们所属的类的指示。 For instance, if we're trying to classify animals we could use height and mass as features. 例如,如果我们尝试对动物进行分类,则可以使用heightmass作为特征。 So if we have an instance in the class elephant , its height might be 3.27m and its mass might be 5142kg. 因此,如果我们在elephant类别中有一个实例,它的高度可能是3.27m,质量可能是5142kg。 An instance in the class dog might have a height of 0.59m and a mass of 10.4kg. dog中的一个实例的高度可能为0.59m,质量为10.4kg。 In classification, if we get something that's 0.8m tall and has a mass of 18.5kg, we know it's more likely to be a dog than a elephant. 在分类中,如果我们得到的东西高0.8m,重18.5kg,我们知道它比大象更可能是狗。

Since we're only using 2 features here we can easily plot them on a graph with one feature as the X-axis and the other feature as the Y (it doesn't really matter which one) with the different classes denoted by different colors or symbols or something. 由于我们在这里仅使用2个特征,因此我们可以轻松地将它们绘制在图形上,其中一个特征作为X轴,另一个特征作为Y(与哪个特征无关紧要),不同类别用不同的颜色表示或符号之类的东西。 If you plot the sample of your training data above, it's easy to see the separation between Class A and B . 如果您在上面绘制训练数据的样本,则很容易看到A类和B类之间的分离。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM