简体   繁体   English

使用Python广播的内存高效L2规范

[英]Memory Efficient L2 norm using Python broadcasting

I am trying to implement a way to cluster points in a test dataset based on their similarity to a sample dataset, using Euclidean distance. 我正在尝试使用欧几里德距离,基于它们与样本数据集的相似性,实现一种在测试数据集中聚类点的方法。 The test dataset has 500 points, each point is a N dimensional vector (N=1024). 测试数据集有500个点,每个点是N维向量(N = 1024)。 The training dataset has around 10000 points and each point is also a 1024- dim vector. 训练数据集大约有10000个点,每个点也是1024维度的向量。 The goal is to find the L2-distance between each test point and all the sample points to find the closest sample (without using any python distance functions). 目标是找到每个测试点和所有采样点之间的L2距离,以找到最接近的样本(不使用任何python距离函数)。 Since the test array and training array have different sizes, I tried using broadcasting: 由于测试阵列和训练阵列有不同的大小,我尝试使用广播:

    import numpy as np
    dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2))

where test is an array of shape (500,1024) and train is an array of shape (10000,1024). 其中test是一个形状数组(500,1024),train是一个形状数组(10000,1024)。 I am getting a MemoryError. 我得到一个MemoryError。 However, the same code works for smaller arrays. 但是,相同的代码适用于较小的数组。 For example: 例如:

     test= np.array([[1,2],[3,4]])
     train=np.array([[1,0],[0,1],[1,1]])

Is there a more memory efficient way to do the above computation without loops? 是否有一种更有效的内存方式来进行上述计算而没有循环? Based on the posts online, we can implement L2- norm using matrix multiplication sqrt(X * X-2*X * Y+Y * Y). 基于在线帖子,我们可以使用矩阵乘法sqrt(X * X-2 * X * Y + Y * Y)来实现L2范数。 So I tried the following: 所以我尝试了以下方法:

    x2 = np.dot(test, test.T)
    y2 = np.dot(train,train.T)
    xy = 2* np.dot(test,train.T)

    dist = np.sqrt(x2 - xy + y2)

Since the matrices have different shapes, when I tried to broadcast, there is a dimension mismatch and I am not sure what is the right way to broadcast (dont have much experience with Python broadcasting). 由于矩阵具有不同的形状,当我尝试广播时,存在尺寸不匹配,我不确定什么是正确的广播方式(没有太多的Python广播经验)。 I would like to know what is the right way to implement the L2 distance computation as a matrix multiplication in Python, where the matrices have different shapes. 我想知道在Python中实现L2距离计算作为矩阵乘法的正确方法是什么,其中矩阵具有不同的形状。 The resultant distance matrix should have dist[i,j] = Euclidean distance between test point i and sample point j. 合成距离矩阵应该具有dist [i,j] =测试点i和样本点j之间的欧几里德距离。

thanks 谢谢

Here is broadcasting with shapes of the intermediates made explicit: 这里播放的中间体形状明确:

m = x.shape[0] # x has shape (m, d)
n = y.shape[0] # y has shape (n, d)
x2 = np.sum(x**2, axis=1).reshape((m, 1))
y2 = np.sum(y**2, axis=1).reshape((1, n))
xy = x.dot(y.T) # shape is (m, n)
dists = np.sqrt(x2 + y2 - 2*xy) # shape is (m, n)

The documentation on broadcasting has some pretty good examples. 关于广播的文档有一些很好的例子。

Simplified and working version from this answer : 这个答案的简化和工作版本:

x, y = test, train

x2 = np.sum(x**2, axis=1, keepdims=True)
y2 = np.sum(y**2, axis=1)
xy = np.dot(x, y.T)
dist = np.sqrt(x2 - 2*xy + y2)

So the approach you have in mind is correct, but you need to be careful how you apply it. 所以你想到的方法是正确的,但你需要小心如何应用它。

To make your life easier, consider using the tested and proven functions from scipy or scikit-learn . 为了让您的生活更轻松,请考虑使用scipyscikit-learn中经过测试和验证的功能。

I think what you are asking for already exists in scipy in the form of the cdist function. 我认为你所要求的已经以cdist函数的形式存在于scipy中。

from scipy.spatial.distance import cdist
res = cdist(test, train, metric='euclidean')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 L2 范数(python)在描述符中查找匹配项? - How to find matches in descriptors using L2 norm (python)? 在python中使用L2范数的LAD? (sklearn) - LAD with L2 norm in python? (sklearn) python sklearn:“ sklearn.preprocessing.normalize(X,norm ='l2')”和“ sklearn.svm.LinearSVC(penalty ='l2')”之间有什么区别 - python sklearn: what is the different between “sklearn.preprocessing.normalize(X, norm='l2')” and “sklearn.svm.LinearSVC(penalty='l2')” Keras lambda层为l2范数 - Keras lambda layer for l2 norm numpy.linalg.norm VS L2 规范的 scipy cdist - numpy.linalg.norm VS scipy cdist for L2 norm 有没有办法在python中一次计算多个二维矩阵的L2范数? - is there any way to calculate L2 norm of multiple 2d matrices at once, in python? 是否有 Python function 来计算 2 个矩阵之间的最小 L2 范数,直到列排列? - Is there a Python function to compute minimal L2 norm between 2 matrices up to column permutation? 使用Keras Lambda层计算L2范数似乎不起作用 - Calculating L2 norm using Keras Lambda layer doesn't seem to work 如何在 Pytorch 的 CNN 中访问卷积层的权重和 L2 范数? - How to access weight and L2 norm of conv layers in a CNN in Pytorch? 回归模型中成本函数的 L1 范数代替 L2 范数 - L1 norm instead of L2 norm for cost function in regression model
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM