Distance calculation on matrix using numpy

Question

I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with:

def AssignPoints(points, centroids):
    """
    Takes two arguments:
    points is a numpy array such that points.shape = m , n where m is number of examples,
    and n is number of dimensions.

    centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
    k < m should hold.

    Returns:
    numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
    """

    m ,n = points.shape
    temp = []
    for i in xrange(n):
        temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
    distances = np.hypot(*temp)
    return distances.argmin(axis=1)

Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. This was working fine, until I tried it with 4 dimensional examples. When I try to put 4 dimensional examples, I get this error:

  File "/path/to/the/kmeans.py", line 28, in AssignPoints
    distances = np.hypot(*temp)
ValueError: invalid number of arguments

How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here?

My Answer

def AssignPoints(points, centroids):
    m ,n = points.shape
    temp = []
    for i in xrange(n):
        temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
    for i in xrange(len(temp)):
        temp[i] = temp[i] ** 2
    distances = np.add.reduce(temp) ** 0.5
    return distances.argmin(axis=1)

Answer 1

Try this:

np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)

Or:

diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)

And don't ask what's it doing :D

Edit: nah, just kidding. The broadcasting in the middle ( points[np.newaxis] - centroids[:,np.newaxis] ) is "making" two 3D arrays from the original ones. The result is such that each "plane" contains the difference between all the points and one of the centroids. Let's call it diffs .

Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): np.sqrt((diffs ** 2).sum(axis=2)) . We end up with a (k, m) matrix where row 0 contain the distances to centroids[0] , etc. So, the .argmin(axis=0) gives you the result you wanted.

Answer 2

You need to define a distance function where you are using hypot. Usually in K-means it is Distance=sum((point-centroid)^2) Here is some matlab code that does it ... I can port it if you can't, but give it a go. Like you said, only way to learn.

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%

% Set K
K = size(centroids, 1);

[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);


% Go over every example, find its closest centroid, and store
%               the index inside idx at the appropriate location.
%               Concretely, idx(i) should contain the index of the centroid
%               closest to example i. Hence, it should be a value in the 
%               range 1..K
%
for loop=1:numberOfExamples
    Distance = sum(bsxfun(@minus,X(loop,:),centroids).^2,2);
    [value index] = min(Distance);
    idx(loop) = index;
end;


end

UPDATE

This should return the distance, notice that the above matlab code just returns the distance(and index) of the closest centroid...your function returns all distances, as does the one below.

def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
    distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance

Distance calculation on matrix using numpy

Question

My Answer

2 answers

solution1
4 ACCPTED 2012-01-11 23:22:54

solution2
0 2012-01-11 22:23:22

Distance calculation on matrix using numpy

Question

My Answer

2 answers

solution1 4 ACCPTED 2012-01-11 23:22:54

solution2 0 2012-01-11 22:23:22

solution1
4 ACCPTED 2012-01-11 23:22:54

solution2
0 2012-01-11 22:23:22