[英]Calculate weighted pairwise distance matrix in Python
I am trying to find the fastest way to perform the following pairwise distance calculation in Python.我试图找到在 Python 中执行以下成对距离计算的最快方法。 I want to use the distances to rank a
list_of_objects
by their similarity.我想使用距离按相似性对
list_of_objects
进行排名。
Each item in the list_of_objects
is characterised by four measurements a, b, c, d, which are made on very different scales eg: list_of_objects
中的每一项都以四个度量 a、b、c、d 为特征,这些度量是在非常不同的尺度上进行的,例如:
object_1 = [0.2, 4.5, 198, 0.003]
object_2 = [0.3, 2.0, 999, 0.001]
object_3 = [0.1, 9.2, 321, 0.023]
list_of_objects = [object_1, object_2, object_3]
The aim is to get a pairwise distance matrix of the objects in list_of_objects
.目的是获得
list_of_objects
对象的成对距离矩阵。 However, I want to be able to specify the 'relative importance' of each measurement in my distance calculation via a weights vector with one weight per measurement, eg:但是,我希望能够通过权重向量在我的距离计算中指定每个度量的“相对重要性”,每个度量一个权重,例如:
weights = [1, 1, 1, 1]
would indicate that all measurements are equally weighted.表示所有测量值的权重相等。 In this case I want each measurement to contribute equally to the distance between objects, regardless of the measurement scale.
在这种情况下,我希望每次测量对物体之间的距离的贡献相等,而不管测量比例如何。 Alternatively:
或者:
weights = [1, 1, 1, 10]
would indicate that I want measurement d to contribute 10x more than the other measurements to the distance between objects.表示我希望测量 d 对物体之间的距离的贡献比其他测量多 10 倍。
My current algorithm looks like this:我当前的算法如下所示:
weights
weights
的适当权weights
list_of_objects
list_of_objects
的对象对的排名列表This works fine, and gives me a weighted version of the city-block distance between objects.这很好用,并为我提供了对象之间城市街区距离的加权版本。
I have two questions:我有两个问题:
Without changing the algorithm, what's the fastest implementation in SciPy, NumPy or SciKit-Learn to perform the initial distance matrix calculations.在不改变算法的情况下,在 SciPy、NumPy 或 SciKit-Learn 中执行初始距离矩阵计算的最快实现是什么。
Is there an existing multi-dimensional distance approach that does all of this for me?是否有现有的多维距离方法可以为我完成所有这些工作?
For Q 2, I have looked, but couldn't find anything with a built-in step that does the 'relative importance' in the way that I want.对于 Q 2,我已经查看过,但找不到任何以我想要的方式执行“相对重要性”的内置步骤。
Other suggestions welcome.欢迎其他建议。 Happy to clarify if I've missed details.
很高兴澄清我是否遗漏了细节。
scipy.spatial.distance
is the module you'll want to have a look at. scipy.spatial.distance
是您想要查看的模块。 It has a lot of different norms that can be easily applied.它有很多不同的规范,可以很容易地应用。
I'd recommend using the weighted Monkowski Metrik我建议使用加权 Monkowski Metrik
Weighted Minkowski Metrik 加权 Minkowski Metrik
You can do pairwise distance calculation by using the pdist
method from this package.您可以使用此包中的
pdist
方法进行成对距离计算。
Eg例如
import numpy as np
from scipy.spatial.distance import pdist, wminkowski, squareform
object_1 = [0.2, 4.5, 198, 0.003]
object_2 = [0.3, 2.0, 999, 0.001]
object_3 = [0.1, 9.2, 321, 0.023]
list_of_objects = [object_1, object_2, object_3]
# make a 3x4 array from the list of objects
X = np.array(list_of_objects)
#calculate pairwise distances, using weighted Minkowski norm
distances = pdist(X,wminkowski,2, [1,1,1,10])
#make a square matrix from result
distances_as_2d_matrix = squareform(distances)
print distances
print distances_as_2d_matrix
This will print这将打印
[ 801.00390786 123.0899671 678.0382942 ]
[[ 0. 801.00390786 123.0899671 ]
[ 801.00390786 0. 678.0382942 ]
[ 123.0899671 678.0382942 0. ]]
The normalization step, where you divide pairwise distances by the max value, seems non-standard, and may make it hard to find a ready-made function that will do exactly what you are after.将成对距离除以最大值的归一化步骤似乎是非标准的,并且可能很难找到一个现成的函数来满足您的要求。 It is pretty easy though to do it yourself.
虽然自己做很容易。 A starting point is to turn your
list_of_objects
into an array:一个起点是将你的
list_of_objects
变成一个数组:
>>> obj_arr = np.array(list_of_objects)
>>> obj_arr.shape
(3L, 4L)
You can then get the pairwise distances using broadcasting.然后,您可以使用广播获得成对距离。 This is a little inefficient, because it is not taking advantage of the symettry of your metric, and is calculating every distance twice:
这有点低效,因为它没有利用度量的对称性,而是对每个距离计算两次:
>>> dists = np.abs(obj_arr - obj_arr[:, None])
>>> dists.shape
(3L, 3L, 4L)
Normalizing is very easy to do:标准化很容易做到:
>>> dists /= dists.max(axis=(0, 1))
And your final weighing can be done in a variety of ways, you may want to benchmark which is fastest:您的最终称重可以通过多种方式完成,您可能希望以最快的方式进行基准测试:
>>> dists.dot([1, 1, 1, 1])
array([[ 0. , 1.93813131, 2.21542674],
[ 1.93813131, 0. , 3.84644195],
[ 2.21542674, 3.84644195, 0. ]])
>>> np.einsum('ijk,k->ij', dists, [1, 1, 1, 1])
array([[ 0. , 1.93813131, 2.21542674],
[ 1.93813131, 0. , 3.84644195],
[ 2.21542674, 3.84644195, 0. ]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.