简体   繁体   中英

K Nearest Neighbor Questions

Hi I am having trouble understanding the workings of the K nearest neighbor algorithm specifically when trying to implement it in code. I am implementing this in R but just want to know the workings, I'm not so much worried about the code as much as the process. I will post what I have, my data, and what my questions are:

Training Data (just a portion of it): 

Feature1 | Feature2  | Class
   2     |     2     |   A
   1     |     4     |   A
   3     |     10    |   B
   12    |     100   |   B
   5     |     5     |   A

So far in my code:

kNN <- function(trainingData, sampleToBeClassified){

    #file input
    train <- read.table(trainingData,sep=",",header=TRUE)
    #get the data as a matrix (every column but the class column)
    labels <- as.matrix(train[,ncol(train)])
    #get the classes (just the class column)
    features <- as.matrix(train[,1:(ncol(train)-1)])
}

And for this I am calculating the "distance" using this formula:

distance <- function(x1,x2) {
   return(sqrt(sum((x1 - x2) ^ 2)))
}

So is the process for the rest of the algorithm as follows:?

1.Loop through every data (in this case every row for the 2 columns) and calculate the distance from the one number at a time and compare it to the sampleToBeClassified?

2.In the starting case that I want 1 nearest-neighbor classification, would I just be storing the variable that has the least distance to my sampleToBeClassified?

3.Whatever the closest distance variable is find out what class it is, then that class becomes the class of the sampleToBeClassified?

My main question is what role do the features play in this? My instinct is that the two features together are what defines that data item as a certain class, so what should I be calculating the distance between?

Am I on the right track at all? Thanks

It looks as though you're on the right track. The three steps in your process seem to be correct for the 1-nearest neighbor cases. For kNN, you just need to make a list of the k nearest neighbors and then determine which class is most prevalent in that list.

As for features, these are just attributes that define each instance and (hopefully) give us an indication as to what class they belong to. For instance, if we're trying to classify animals we could use height and mass as features. So if we have an instance in the class elephant , its height might be 3.27m and its mass might be 5142kg. An instance in the class dog might have a height of 0.59m and a mass of 10.4kg. In classification, if we get something that's 0.8m tall and has a mass of 18.5kg, we know it's more likely to be a dog than a elephant.

Since we're only using 2 features here we can easily plot them on a graph with one feature as the X-axis and the other feature as the Y (it doesn't really matter which one) with the different classes denoted by different colors or symbols or something. If you plot the sample of your training data above, it's easy to see the separation between Class A and B .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM