在 Python 中计算加权成对距离矩阵

Question

I am trying to find the fastest way to perform the following pairwise distance calculation in Python.我试图找到在 Python 中执行以下成对距离计算的最快方法。 I want to use the distances to rank a list_of_objects by their similarity.我想使用距离按相似性对list_of_objects进行排名。

Each item in the list_of_objects is characterised by four measurements a, b, c, d, which are made on very different scales eg: list_of_objects中的每一项都以四个度量 a、b、c、d 为特征，这些度量是在非常不同的尺度上进行的，例如：

object_1 = [0.2, 4.5, 198, 0.003]
object_2 = [0.3, 2.0, 999, 0.001]
object_3 = [0.1, 9.2, 321, 0.023]
list_of_objects = [object_1, object_2, object_3]

The aim is to get a pairwise distance matrix of the objects in list_of_objects .目的是获得list_of_objects对象的成对距离矩阵。 However, I want to be able to specify the 'relative importance' of each measurement in my distance calculation via a weights vector with one weight per measurement, eg:但是，我希望能够通过权重向量在我的距离计算中指定每个度量的“相对重要性”，每个度量一个权重，例如：

weights = [1, 1, 1, 1]

would indicate that all measurements are equally weighted.表示所有测量值的权重相等。 In this case I want each measurement to contribute equally to the distance between objects, regardless of the measurement scale.在这种情况下，我希望每次测量对物体之间的距离的贡献相等，而不管测量比例如何。 Alternatively:或者：

weights = [1, 1, 1, 10]

would indicate that I want measurement d to contribute 10x more than the other measurements to the distance between objects.表示我希望测量 d 对物体之间的距离的贡献比其他测量多 10 倍。

My current algorithm looks like this:我当前的算法如下所示：

Calculate a pairwise distance matrix for each measurement为每次测量计算成对距离矩阵
Normalise each distance matrix so that the maximum is 1标准化每个距离矩阵，使最大值为 1
Multiply each distance matrix by the appropriate weight from weights将每个距离矩阵乘以来自weights的适当权weights
Sum the distance matrices to generate a single pairwise matrix对距离矩阵求和以生成单个成对矩阵
Use the matrix from 4 to provide a ranked list of pairs of objects from list_of_objects使用来自 4 的矩阵提供来自list_of_objects的对象对的排名列表

This works fine, and gives me a weighted version of the city-block distance between objects.这很好用，并为我提供了对象之间城市街区距离的加权版本。

I have two questions:我有两个问题：

Without changing the algorithm, what's the fastest implementation in SciPy, NumPy or SciKit-Learn to perform the initial distance matrix calculations.在不改变算法的情况下，在 SciPy、NumPy 或 SciKit-Learn 中执行初始距离矩阵计算的最快实现是什么。
Is there an existing multi-dimensional distance approach that does all of this for me?是否有现有的多维距离方法可以为我完成所有这些工作？

For Q 2, I have looked, but couldn't find anything with a built-in step that does the 'relative importance' in the way that I want.对于 Q 2，我已经查看过，但找不到任何以我想要的方式执行“相对重要性”的内置步骤。

Other suggestions welcome.欢迎其他建议。 Happy to clarify if I've missed details.很高兴澄清我是否遗漏了细节。

Answer 1

scipy.spatial.distance is the module you'll want to have a look at. scipy.spatial.distance是您想要查看的模块。 It has a lot of different norms that can be easily applied.它有很多不同的规范，可以很容易地应用。

I'd recommend using the weighted Monkowski Metrik我建议使用加权 Monkowski Metrik

Weighted Minkowski Metrik 加权 Minkowski Metrik

You can do pairwise distance calculation by using the pdist method from this package.您可以使用此包中的pdist方法进行成对距离计算。

Eg例如

import numpy as np
from scipy.spatial.distance import pdist, wminkowski, squareform

object_1 = [0.2, 4.5, 198, 0.003]
object_2 = [0.3, 2.0, 999, 0.001]
object_3 = [0.1, 9.2, 321, 0.023]
list_of_objects = [object_1, object_2, object_3]

# make a 3x4 array from the list of objects
X = np.array(list_of_objects)

#calculate pairwise distances, using weighted Minkowski norm
distances = pdist(X,wminkowski,2, [1,1,1,10])

#make a square matrix from result
distances_as_2d_matrix = squareform(distances)

print distances
print distances_as_2d_matrix

This will print这将打印

[ 801.00390786  123.0899671   678.0382942 ]
[[   0.          801.00390786  123.0899671 ]
 [ 801.00390786    0.          678.0382942 ]
 [ 123.0899671   678.0382942     0.        ]]

Answer 2

The normalization step, where you divide pairwise distances by the max value, seems non-standard, and may make it hard to find a ready-made function that will do exactly what you are after.将成对距离除以最大值的归一化步骤似乎是非标准的，并且可能很难找到一个现成的函数来满足您的要求。 It is pretty easy though to do it yourself.虽然自己做很容易。 A starting point is to turn your list_of_objects into an array:一个起点是将你的list_of_objects变成一个数组：

>>> obj_arr = np.array(list_of_objects)
>>> obj_arr.shape
(3L, 4L)

You can then get the pairwise distances using broadcasting.然后，您可以使用广播获得成对距离。 This is a little inefficient, because it is not taking advantage of the symettry of your metric, and is calculating every distance twice:这有点低效，因为它没有利用度量的对称性，而是对每个距离计算两次：

>>> dists = np.abs(obj_arr - obj_arr[:, None])
>>> dists.shape
(3L, 3L, 4L)

Normalizing is very easy to do:标准化很容易做到：

>>> dists /= dists.max(axis=(0, 1))

And your final weighing can be done in a variety of ways, you may want to benchmark which is fastest:您的最终称重可以通过多种方式完成，您可能希望以最快的方式进行基准测试：

>>> dists.dot([1, 1, 1, 1])
array([[ 0.        ,  1.93813131,  2.21542674],
       [ 1.93813131,  0.        ,  3.84644195],
       [ 2.21542674,  3.84644195,  0.        ]])
>>> np.einsum('ijk,k->ij', dists, [1, 1, 1, 1])
array([[ 0.        ,  1.93813131,  2.21542674],
       [ 1.93813131,  0.        ,  3.84644195],
       [ 2.21542674,  3.84644195,  0.        ]])

在 Python 中计算加权成对距离矩阵

问题描述

2 个解决方案

解决方案1
12 2013-11-20 07:26:02

解决方案2
3 2013-11-20 13:18:48

在 Python 中计算加权成对距离矩阵

问题描述

2 个解决方案

解决方案1 12 2013-11-20 07:26:02

解决方案2 3 2013-11-20 13:18:48

解决方案1
12 2013-11-20 07:26:02

解决方案2
3 2013-11-20 13:18:48