基于numpy的矩阵距离计算

Question

我正在尝试在Python中实现K-means算法（我知道有这样的库，但我想学习如何自己实现它。）这是我遇到的问题：

def AssignPoints(points, centroids):
    """
    Takes two arguments:
    points is a numpy array such that points.shape = m , n where m is number of examples,
    and n is number of dimensions.

    centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
    k < m should hold.

    Returns:
    numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
    """

    m ,n = points.shape
    temp = []
    for i in xrange(n):
        temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
    distances = np.hypot(*temp)
    return distances.argmin(axis=1)

该函数的目的，给定n维空间中的m个点和n维空间中的k个质心，产生（x1 x2 x3 x4 ... xm）的numpy数组，其中x1是最接近第一个点的质心的索引。 这工作正常，直到我尝试使用4维示例。 当我尝试放置4维示例时，我收到此错误：

  File "/path/to/the/kmeans.py", line 28, in AssignPoints
    distances = np.hypot(*temp)
ValueError: invalid number of arguments

我怎么能解决这个问题，或者如果我不能解决这个问题，你怎么建议我计算我想在这里计算的东西？

我的答案

def AssignPoints(points, centroids):
    m ,n = points.shape
    temp = []
    for i in xrange(n):
        temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
    for i in xrange(len(temp)):
        temp[i] = temp[i] ** 2
    distances = np.add.reduce(temp) ** 0.5
    return distances.argmin(axis=1)

Answer 1

尝试这个：

np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)

要么：

diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)

不要问它在做什么：D

编辑：不，开个玩笑。 中间的广播（ points[np.newaxis] - centroids[:,np.newaxis] ）正在“制作”原始的两个3D阵列。 结果是每个“平面”包含所有点和一个质心之间的差异。 我们称之为diffs 。

然后我们进行通常的操作来计算欧氏距离（差异平方的平方根）： np.sqrt((diffs ** 2).sum(axis=2)) 。 我们最终得到一个(k, m)矩阵，其中第0行包含到centroids[0]等的距离。因此， .argmin(axis=0)为您提供所需的结果。

Answer 2

您需要在使用hypot的地方定义距离函数。 通常在K-means中它是距离=总和（（点 - 质心）^ 2）这是一些matlab代码，它可以...如果你不能，我可以移植它，但是试一试。 就像你说的，只有学习的方式。

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%

% Set K
K = size(centroids, 1);

[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);


% Go over every example, find its closest centroid, and store
%               the index inside idx at the appropriate location.
%               Concretely, idx(i) should contain the index of the centroid
%               closest to example i. Hence, it should be a value in the 
%               range 1..K
%
for loop=1:numberOfExamples
    Distance = sum(bsxfun(@minus,X(loop,:),centroids).^2,2);
    [value index] = min(Distance);
    idx(loop) = index;
end;


end

UPDATE

这应该返回距离，注意上面的matlab代码只返回最近质心的距离（和索引）...你的函数返回所有距离，如下所示。

def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
    distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance

基于numpy的矩阵距离计算

问题描述

我的答案

2 个解决方案

解决方案1
4 已采纳 2012-01-11 23:22:54

解决方案2
0 2012-01-11 22:23:22

基于numpy的矩阵距离计算

问题描述

我的答案

2 个解决方案

解决方案1 4 已采纳 2012-01-11 23:22:54

解决方案2 0 2012-01-11 22:23:22

解决方案1
4 已采纳 2012-01-11 23:22:54

解决方案2
0 2012-01-11 22:23:22