简体   繁体   中英

My own K-means algorithm in R

I am a beginner at R programming and I am doing this exercise in R as an intro to programming. I have made my own K means implementation in R, but have been stuck for a while at a one point: I need to make a consensus, where the algorithm iterates until it finds the optimal center of each cluster.

This is the raw algorithm without iteration. It just take a random data point from the whole data as a center, which number is defined by k.

Centroid_test=data[sample(nrow(data), k), ]
x = Centroid_test
y = data
m=apply(data,1,function(data)   (apply(Centroid_test,1,function(Centroid_test,y)
dist(rbind(Centroid_test,data)),data)))
colnames(m)=rownames(y)
minByCol <- apply(m, MARGIN=2, FUN=which.min)
minByColdf=as.data.frame(minByCol)
MasterDataframe=data.frame(data,minByColdf)
Sort_Master=MasterDataframe[ order(MasterDataframe[,3], MasterDataframe[,3]), ]
res=data.frame(Sort_Master)
cen=Centroid_test
rownames(cen)=1:k
res
cen

So, I have some cluster centers and data points accompanied to each cluster, but it is not the optimal center. How can I find the good centers?

My attempt is below. I know that I have to iterate the above code, for lets say kmax times, until it meets a condition that would be stop the iteration and thus give the best cluster to fit the data:

for (n in 1:kmax){

  if (condition)
    break;
}

But how do I define the condition? After reading a bit about k means, one idea was to find a center which value is the closest to the mean of its group.I wrote this bit of code:

kn=1
group=subset(res, res[,3] == 1)
mean(group$x)
mean(group$y)
cen[kn,]$x
cen[kn,]$y

But I do not know how to write in code "the more similar the mean". Another idea I found was to find the cluster that has the minimum distance from each point. I could not think how could I write this into code successfully.

If you could show me how or share an idea, that would be very helpful!

Thanks a lot in advance!

EDIT:

To clarify:

So, what I want is to do some sort of algorithm that will find the optimal centers of clusters with regard to the distance between the center and points of each cluster. After reading more about k-means algorithms, I found there are the Forgy/Lloyd algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm. Each one tries to find the optimal center with different approaches.

The above code assigns random points as centers, and then calculates the how far is each point to each centers, and the points with the minimal distance from a point, gets to be assigned to that points cluster. cen contains the centers of each cluster, and res gives the all the points assigned to each cluster(thats what the third column is for).

My idea was to calculate first the distance of each point of the group to center after being grouped into clusters, and save it to a data frame or something else. The next step would be to do all again: find new random centers, assign again points to each center, form the clusters and finally calculate the distance between the points and centers, to save them again. In the end there will be a data frame or matrix with many ( for example after 100 iterations), distances and then we could find the centers that gave the smallest distance between each point and the cluster center. These points with the minimal distance to the other points are the optimal centers of clusters.

Dummy data:

y=rnorm(500,1.65)
x=rnorm(500,1.15)

data=cbind(x,y)

After running the above code, run plot to see the centers of cluster:

plot(data)
points(cen, pch=21,bg=23)

The function for calculating the Euclidean distance:

euclid <- function(points1, points2) {
  distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
  for(i in 1:nrow(points2)) {
    distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
  }
  distanceMatrix
}

The K means algorithm that uses the Euclidean distance above:

K_means <- function(x, centers, distFun, nItter) {
  clusterHistory <- vector(nItter, mode="list")
  centerHistory <- vector(nItter, mode="list")

  for(i in 1:nItter) {
    distsToCenters <- distFun(x, centers)
    clusters <- apply(distsToCenters, 1, which.min)
    centers <- apply(x, 2, tapply, clusters, mean)
    # Saving history
    clusterHistory[[i]] <- clusters
    centerHistory[[i]] <- centers
  }

  list(clusters=clusterHistory, centers=centerHistory)
}

Prepare the data:

test=data # A data.frame
ktest=as.matrix(test) # Turn into a matrix
centers <- ktest[sample(nrow(ktest), 5),] # Sample some centers, 5 for example

Results

res <- K_means(ktest, centers, euclid, 10)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM