简体   繁体   English

使用距离矩阵在Pandas Dataframe中的行之间进行距离计算

[英]Distance calculation between rows in Pandas Dataframe using a distance matrix

I have the following Pandas DataFrame: 我有以下Pandas DataFrame:

In [31]:
import pandas as pd
sample = pd.DataFrame({'Sym1': ['a','a','a','d'],'Sym2':['a','c','b','b'],'Sym3':['a','c','b','d'],'Sym4':['b','b','b','a']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
      Sym1 Sym2 Sym3 Sym4
Item1    a    a    a    b
Item2    a    c    c    b
Item3    a    b    b    b
Item4    d    b    d    a

and I want to find the elegant way to get the distance between each Item according to this distance matrix: 我想找到一种优雅的方法来根据这个距离矩阵得到每个Item之间的距离:

In [34]:
DistMatrix = pd.DataFrame({'a': [0,0,0.67,1.34],'b':[0,0,0,0.67],'c':[0.67,0,0,0],'d':[1.34,0.67,0,0]},index=['a','b','c','d'])
print(DistMatrix)
Out[34]:
      a     b     c     d
a  0.00  0.00  0.67  1.34
b  0.00  0.00  0.00  0.67
c  0.67  0.00  0.00  0.00
d  1.34  0.67  0.00  0.00 

For example comparing Item1 to Item2 would compare aaab -> accb -- using the distance matrix this would be 0+0.67+0.67+0=1.34 例如,比较Item1Item2将比较aaab - > accb - 使用距离矩阵,这将是0+0.67+0.67+0=1.34

Ideal output: 理想输出:

       Item1   Item2  Item3  Item4
Item1      0    1.34     0    2.68
Item2     1.34    0      0    1.34
Item3      0      0      0    2.01
Item4     2.68  1.34   2.01    0

This is an old question, but there is a Scipy function that does this: 这是一个老问题,但有一个Scipy函数可以做到这一点:

from scipy.spatial.distance import pdist, squareform

distances = pdist(sample.values, metric='euclidean')
dist_matrix = squareform(distances)

pdist operates on Numpy matrices, and DataFrame.values is the underlying Numpy NDarray representation of the data frame. pdist在Numpy矩阵上运行, DataFrame.values是数据帧的底层Numpy NDarray表示。 The metric argument allows you to select one of several built-in distance metrics, or you can pass in any binary function to use a custom distance. metric参数允许您选择几个内置距离度量中的一个,或者您可以传入任何二进制函数以使用自定义距离。 It's very powerful and, in my experience, very fast. 这是非常强大的,根据我的经验,非常快。 The result is a "flat" array that consists only of the upper triangle of the distance matrix (because it's symmetric), not including the diagonal (because it's always 0). 结果是一个“平面”数组,它只包含距离矩阵的上三角形(因为它是对称的),不包括对角线(因为它总是为0)。 squareform then translates this flattened form into a full matrix. squareform然后将这种扁平形式转换为完整矩阵。

The docs have more info, including a mathematical rundown of the many built-in distance functions. 文档有更多信息,包括许多内置距离函数的数学纲要。

For a large data, I found a fast way to do this. 对于大数据,我发现了一种快速的方法。 Assume your data is already in np.array format, named as a. 假设您的数据已经是np.array格式,名为a。

from sklearn.metrics.pairwise import euclidean_distances
dist = euclidean_distances(a, a)

Below is an experiment to compare the time needed for two approaches: 以下是比较两种方法所需时间的实验:

a = np.random.rand(1000,1000)
import time 
time1 = time.time()
distances = pdist(a, metric='euclidean')
dist_matrix = squareform(distances)
time2 = time.time()
time2 - time1  #0.3639109134674072

time1 = time.time()
dist = euclidean_distances(a, a)
time2 = time.time()
time2-time1  #0.08735871315002441

this is doing twice as much work as needed, but technically works for non-symmetric distance matrices as well ( whatever that is supposed to mean ) 这是根据需要做两倍的工作,但技术上也适用于非对称距离矩阵(不管是什么意思)

pd.DataFrame ( { idx1: { idx2:sum( DistMatrix[ x ][ y ]
                                  for (x, y) in zip( row1, row2 ) ) 
                         for (idx2, row2) in sample.iterrows( ) } 
                 for (idx1, row1 ) in sample.iterrows( ) } )

you can make it more readable by writing it in pieces: 您可以通过将其分成几部分来使其更具可读性:

# a helper function to compute distance of two items
dist = lambda xs, ys: sum( DistMatrix[ x ][ y ] for ( x, y ) in zip( xs, ys ) )

# a second helper function to compute distances from a given item
xdist = lambda x: { idx: dist( x, y ) for (idx, y) in sample.iterrows( ) }

# the pairwise distance matrix
pd.DataFrame( { idx: xdist( x ) for ( idx, x ) in sample.iterrows( ) } )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM