使用Python广播的内存高效L2规范

Question

I am trying to implement a way to cluster points in a test dataset based on their similarity to a sample dataset, using Euclidean distance. 我正在尝试使用欧几里德距离，基于它们与样本数据集的相似性，实现一种在测试数据集中聚类点的方法。 The test dataset has 500 points, each point is a N dimensional vector (N=1024). 测试数据集有500个点，每个点是N维向量（N = 1024）。 The training dataset has around 10000 points and each point is also a 1024- dim vector. 训练数据集大约有10000个点，每个点也是1024维度的向量。 The goal is to find the L2-distance between each test point and all the sample points to find the closest sample (without using any python distance functions). 目标是找到每个测试点和所有采样点之间的L2距离，以找到最接近的样本（不使用任何python距离函数）。 Since the test array and training array have different sizes, I tried using broadcasting: 由于测试阵列和训练阵列有不同的大小，我尝试使用广播：

    import numpy as np
    dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2))

where test is an array of shape (500,1024) and train is an array of shape (10000,1024). 其中test是一个形状数组（500,1024），train是一个形状数组（10000,1024）。 I am getting a MemoryError. 我得到一个MemoryError。 However, the same code works for smaller arrays. 但是，相同的代码适用于较小的数组。 For example: 例如：

     test= np.array([[1,2],[3,4]])
     train=np.array([[1,0],[0,1],[1,1]])

Is there a more memory efficient way to do the above computation without loops? 是否有一种更有效的内存方式来进行上述计算而没有循环？ Based on the posts online, we can implement L2- norm using matrix multiplication sqrt(X * X-2*X * Y+Y * Y). 基于在线帖子，我们可以使用矩阵乘法sqrt（X * X-2 * X * Y + Y * Y）来实现L2范数。 So I tried the following: 所以我尝试了以下方法：

    x2 = np.dot(test, test.T)
    y2 = np.dot(train,train.T)
    xy = 2* np.dot(test,train.T)

    dist = np.sqrt(x2 - xy + y2)

Since the matrices have different shapes, when I tried to broadcast, there is a dimension mismatch and I am not sure what is the right way to broadcast (dont have much experience with Python broadcasting). 由于矩阵具有不同的形状，当我尝试广播时，存在尺寸不匹配，我不确定什么是正确的广播方式（没有太多的Python广播经验）。 I would like to know what is the right way to implement the L2 distance computation as a matrix multiplication in Python, where the matrices have different shapes. 我想知道在Python中实现L2距离计算作为矩阵乘法的正确方法是什么，其中矩阵具有不同的形状。 The resultant distance matrix should have dist[i,j] = Euclidean distance between test point i and sample point j. 合成距离矩阵应该具有dist [i，j] =测试点i和样本点j之间的欧几里德距离。

thanks 谢谢

Answer 1

Here is broadcasting with shapes of the intermediates made explicit: 这里播放的中间体形状明确：

m = x.shape[0] # x has shape (m, d)
n = y.shape[0] # y has shape (n, d)
x2 = np.sum(x**2, axis=1).reshape((m, 1))
y2 = np.sum(y**2, axis=1).reshape((1, n))
xy = x.dot(y.T) # shape is (m, n)
dists = np.sqrt(x2 + y2 - 2*xy) # shape is (m, n)

The documentation on broadcasting has some pretty good examples. 关于广播的文档有一些很好的例子。

Answer 2

Simplified and working version from this answer : 这个答案的简化和工作版本：

x, y = test, train

x2 = np.sum(x**2, axis=1, keepdims=True)
y2 = np.sum(y**2, axis=1)
xy = np.dot(x, y.T)
dist = np.sqrt(x2 - 2*xy + y2)

So the approach you have in mind is correct, but you need to be careful how you apply it. 所以你想到的方法是正确的，但你需要小心如何应用它。

To make your life easier, consider using the tested and proven functions from scipy or scikit-learn . 为了让您的生活更轻松，请考虑使用scipy或scikit-learn中经过测试和验证的功能。

Answer 3

I think what you are asking for already exists in scipy in the form of the cdist function. 我认为你所要求的已经以cdist函数的形式存在于scipy中。

from scipy.spatial.distance import cdist
res = cdist(test, train, metric='euclidean')

使用Python广播的内存高效L2规范

问题描述

3 个解决方案

解决方案1
13 2016-01-19 14:33:22

解决方案2
3

解决方案3
2 2016-01-19 14:51:30

使用Python广播的内存高效L2规范

问题描述

3 个解决方案

解决方案1 13 2016-01-19 14:33:22

解决方案2 3

解决方案3 2 2016-01-19 14:51:30

解决方案1
13 2016-01-19 14:33:22

解决方案2
3

解决方案3
2 2016-01-19 14:51:30