简体   繁体   English

创建张量的欧几里得距离矩阵

[英]Creating an euclidean distance matrix of tensors

I have 10000 of matrixes with the shape (32, 32, 3). 我有10000个形状为(32、32、3)的矩阵。 I want to create an euclidean distance matrix between all the matrixes. 我想在所有矩阵之间创建一个欧式距离矩阵。 At the end, it is going to be like, 最后,它将像

[0, d2, d3, d4, ...]
[d1, 0, d3, d4, ...]
[d1, d2, 0, d4, ...]
[d1, d2, d3, 0, ...]

How I can make it in the fastest way? 我如何以最快的方式做到这一点? I have tried the following, but it takes ages to finish. 我尝试了以下方法,但是要花一些时间才能完成。

import numpy as np
dists = []
for a in range(len(X_test)):
    dists.append([])
    for b in range(len(X_test)):
        dists[a].append(np.linalg.norm(X_test[a] - X_test[b]))
print dists

You can cut the time in half by exploiting the fact that the distance matrix is symmetrical and only compute the upper triangular portion by using using 您可以利用距离矩阵是对称的事实将时间减半,而仅使用来计算上三角部分

for b in range(a+1, len(X_test)):

on line 5. 在第5行。

I don't see any other obvious optimizations while keeping the problem exactly the same, but it also seems that you're working with 32x32 images in a three channel format. 在保持问题完全相同的同时,我看不到任何其他明显的优化方法,但是似乎您正在使用三通道格式的32x32图像。 That's 3072 dimensions! 那是3072尺寸! Why not first down-sample to 4x4, convert to HSL color space, and keep only Hue and Lightness to get a (4,4,2) "signature" for each image. 为什么不先将其降采样为4x4,转换为HSL颜色空间,并仅保留“色相”和“亮度”就可以为每个图像获得(4,4,2)“签名”。 If your problem is mostly about shape, you can throw away Hue too and basically work with black-and-white images. 如果您的问题主要与形状有关,则也可以放弃Hue,基本上可以处理黑白图像。

(4,4,2) has only 32 dimensions, for a savings of 100 compared to (32,32,3). (4,4,2)只有32个尺寸,与(32,32,3)相比节省了100。 And if you did want to do the full comparison in the (32,32,3) space, you could do that only on images that are already very similar in the (4,4,2) space. 而且,如果您确实想在(32,32,3)空间中进行全面比较,则只能对(4,4,2)空间中已经非常相似的图像进行此比较。

I have read Divakar comment . 我已经阅读了Divakar的 评论

Rather than asking "Show me Divakar" I asked myself "What is this pdist/cdist stuff?" 我没有问自己“向我显示Divakar”,而是问自己“这是什么pdist / cdist东西?” — I read about pdist and norm and I came out with the following code —我读到有关pdistnorm ,并给出了以下代码

Import stuff: 导入东西:

In [1]: import numpy as np
In [2]: from scipy.spatial.distance import pdist

Generate a random sample, not necessarily as large as the OP's one, and reshape it as suggested by Divakar 生成一个随机样本,不一定与OP的样本一样大,然后按照Divakar的建议对其进行重塑

In [3]: a = np.random.random((100,32,32,3))
In [4]: b = a.reshape((100,32*32*3))

Using the magic of IPython, let's benchmark the two approaches 利用IPython的magic ,让我们对这两种方法进行基准测试

In [5]: %%timeit
   ...: dists = []
   ...: for i in range(len(a)):
   ...:     dists.append([])
   ...:     for j in range(len(a)):
   ...:         dists[i].append(np.linalg.norm(a[i] - a[j]))
128 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [6]: %timeit pdist(b)
12.3 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Divakar's was 1 order of magnitude faster — but what about the accuracy? Divakar的速度快了1个数量级,但准确性如何? Let's repeat the computations... 让我们重复计算...

In [7]: dists1 = []
   ...: for i in range(len(a)):
   ...:     dists1.append([])
   ...:     for j in range(len(a)):
   ...:         dists1[i].append(np.linalg.norm(a[i] - a[j]))
In [8]: dists2 = pdist(b)

To compare the results, we must be aware that pdist computes only the upper triangle of the square matrix of distances (because the matrix is symmetric and the principal diagonal is identically equal to zero) so we must be careful in checking our results: hence I check the off diagonal part of the first row of dists1 with the first 99 elements of dists2 using allclose 为了比较结果,我们必须知道pdist仅计算距离平方矩阵的上三角(因为矩阵是对称的并且主对角线等于零),因此我们在检查结果时必须小心:检查的第一行的断开对角线部分dists1与第一元件99 dists2使用allclose

In [9]: np.allclose(dists1[0][1:], dists2[:99])
Out[9]: True

The result is the same, nice. 结果是一样的,很好。

What about an estimate of the time required for 10,000 elements? 估计10,000个元素所需的时间呢? The feeling is that's quadratic, but let's experiment doubling the number of elements 感觉是二次方,但让我们尝试将元素数量加倍

In [10]: b = np.random.random((200,32*32*3))
In [11]: %timeit pdist(b)
48 ms ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [12]: 

the new timing is 4 times the initial one, so my estimate for your computation, on my feeble pc and using Divakar's proposal, is 12ms x 100 x 100 = 120,000ms = 120s. 新的时间是初始时间的4倍,因此在我微弱的PC上并使用Divakar的建议,我对您的计算的估计为12ms x 100 x 100 = 120,000ms = 120s。 You should read carefully the excellent answer by olooney and decide what you really want to do. 您应该仔细阅读olooney 的出色答案 ,并确定您真正想要做什么。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM