简体   繁体   English

基于numpy的矩阵距离计算

[英]Distance calculation on matrix using numpy

I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with: 我正在尝试在Python中实现K-means算法(我知道有这样的库,但我想学习如何自己实现它。)这是我遇到的问题:

def AssignPoints(points, centroids):
    """
    Takes two arguments:
    points is a numpy array such that points.shape = m , n where m is number of examples,
    and n is number of dimensions.

    centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
    k < m should hold.

    Returns:
    numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
    """

    m ,n = points.shape
    temp = []
    for i in xrange(n):
        temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
    distances = np.hypot(*temp)
    return distances.argmin(axis=1)

Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. 该函数的目的,给定n维空间中的m个点和n维空间中的k个质心,产生(x1 x2 x3 x4 ... xm)的numpy数组,其中x1是最接近第一个点的质心的索引。 This was working fine, until I tried it with 4 dimensional examples. 这工作正常,直到我尝试使用4维示例。 When I try to put 4 dimensional examples, I get this error: 当我尝试放置4维示例时,我收到此错误:

  File "/path/to/the/kmeans.py", line 28, in AssignPoints
    distances = np.hypot(*temp)
ValueError: invalid number of arguments

How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here? 我怎么能解决这个问题,或者如果我不能解决这个问题,你怎么建议我计算我想在这里计算的东西?

My Answer 我的答案

def AssignPoints(points, centroids):
    m ,n = points.shape
    temp = []
    for i in xrange(n):
        temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
    for i in xrange(len(temp)):
        temp[i] = temp[i] ** 2
    distances = np.add.reduce(temp) ** 0.5
    return distances.argmin(axis=1)

Try this: 尝试这个:

np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)

Or: 要么:

diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)

And don't ask what's it doing :D 不要问它在做什么:D

Edit: nah, just kidding. 编辑:不,开个玩笑。 The broadcasting in the middle ( points[np.newaxis] - centroids[:,np.newaxis] ) is "making" two 3D arrays from the original ones. 中间的广播( points[np.newaxis] - centroids[:,np.newaxis] )正在“制作”原始的两个3D阵列。 The result is such that each "plane" contains the difference between all the points and one of the centroids. 结果是每个“平面”包含所有点和一个质心之间的差异。 Let's call it diffs . 我们称之为diffs

Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): np.sqrt((diffs ** 2).sum(axis=2)) . 然后我们进行通常的操作来计算欧氏距离(差异平方的平方根): np.sqrt((diffs ** 2).sum(axis=2)) We end up with a (k, m) matrix where row 0 contain the distances to centroids[0] , etc. So, the .argmin(axis=0) gives you the result you wanted. 我们最终得到一个(k, m)矩阵,其中第0行包含到centroids[0]等的距离。因此, .argmin(axis=0)为您提供所需的结果。

You need to define a distance function where you are using hypot. 您需要在使用hypot的地方定义距离函数。 Usually in K-means it is Distance=sum((point-centroid)^2) Here is some matlab code that does it ... I can port it if you can't, but give it a go. 通常在K-means中它是距离=总和((点 - 质心)^ 2)这是一些matlab代码,它可以...如果你不能,我可以移植它,但是试一试。 Like you said, only way to learn. 就像你说的,只有学习的方式。

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%

% Set K
K = size(centroids, 1);

[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);


% Go over every example, find its closest centroid, and store
%               the index inside idx at the appropriate location.
%               Concretely, idx(i) should contain the index of the centroid
%               closest to example i. Hence, it should be a value in the 
%               range 1..K
%
for loop=1:numberOfExamples
    Distance = sum(bsxfun(@minus,X(loop,:),centroids).^2,2);
    [value index] = min(Distance);
    idx(loop) = index;
end;


end

UPDATE UPDATE

This should return the distance, notice that the above matlab code just returns the distance(and index) of the closest centroid...your function returns all distances, as does the one below. 这应该返回距离,注意上面的matlab代码只返回最近质心的距离(和索引)...你的函数返回所有距离,如下所示。

def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
    distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM