[英]How to calculate euclidian distance between combinations of rows in pandas dataframe
I have the following dataframe:我有以下 dataframe:
import pandas as pd
foo = pd.DataFrame({'cluster': [1,2,3],
'var1': [0.3,0.5,1],
'var2': [0.6,0.2,0.7],
'var3': [0.4,0.4,0.3]})
Each row corresponds to a cluster, and the values of var
's correspond the cluster centre with respect to the specific var
每行对应一个集群, var
的值对应于特定var
的集群中心
I would like to calculate the euclidian distance
of each cluster to the rest.我想计算每个簇到 rest 的欧几里得euclidian distance
。
I tried this我试过这个
from itertools import combinations
def distance(list1, list2):
"""Distance between two vectors."""
squares = [(p - q) ** 2 for p, q in zip(list1, list2)]
return sum(squares) ** .5
foo_m = foo.melt(id_vars='cluster')
for k, v in list(combinations(foo_m.cluster.unique(),2)):
print(k,v)
print(distance(list(foo_m.query('cluster == @k')['value']),
list(foo_m.query('cluster == @v')['value'])))
I want though to output the result in a dataframe in a correlation-like matrix , where the rows and the columns will be the cluster
s and the values would be the distance between the respective cluster
s, any ideas?我想要 output 在类似相关矩阵的 dataframe 中得到结果,其中行和列将是cluster
s,值将是各个cluster
s 之间的距离,有什么想法吗?
The expected output is a symmetric matrix that looks like this:预期的 output 是一个对称矩阵,如下所示:
pd.DataFrame({'cluster': [1,2,3], 'cluster_1':[0,0.447213, 0.71414],
'cluster_2': [0.447213, 0, 0.714142], 'cluster_3':[0.71414, 0.714142, 0]})
Try with scipy
:试试scipy
:
from scipy.spatial.distance import pdist, squareform
output = pd.DataFrame(squareform(pdist(foo.set_index("cluster"))),
index=foo["cluster"].values,
columns=foo["cluster"].values)
>>> output
1 2 3
1 0.000000 0.447214 0.714143
2 0.447214 0.000000 0.714143
3 0.714143 0.714143 0.000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.