[英]Calculating and using Euclidean Distance in Python
I am trying to calculate the Euclidean Distance between two datasets in python.我正在尝试计算 python 中两个数据集之间的欧几里得距离。 I can do this using the following:
我可以使用以下方法来做到这一点:
np.linalg.norm(df-signal)
With df
and signal
being my two datasets. df
和signal
是我的两个数据集。 This returns a single numerical value (ie, 8258155.579535276), which is fine.这将返回一个数值(即 8258155.579535276),这很好。 My issue is that I want it to return the difference between each column in the dataset.
我的问题是我希望它返回数据集中每一列之间的差异。 Something like this:
像这样的东西:
AFNLWGT 4.867376e+10
AGI 3.769233e+09
EMCONTRB 1.202935e+07
FEDTAX 8.095078e+07
PTOTVAL 2.500056e+09
STATETAX 1.007451e+07
TAXINC 2.027124e+09
POTHVAL 1.158428e+08
INTVAL 1.606913e+07
PEARNVAL 2.038357e+09
FICA 1.080950e+07
WSALVAL 1.986075e+09
ERNVAL 1.905109e+09
I'm fairly new to Python so would really appreciate any help possible.我对 Python 相当陌生,所以非常感谢任何可能的帮助。
To have the columnwise norm with column headers you can use pandas.DataFrame.aggregate together with np.linalg.norm
:要获得带有列标题的列标准,您可以使用pandas.DataFrame.aggregate和
np.linalg.norm
:
import pandas as pd
import numpy as np
norms = (df-signal).aggregate(np.linalg.norm)
Notice that, by default, .aggregate
operates along the 0-axis (hence columns).请注意,默认情况下,
.aggregate
沿 0 轴(因此是列)运行。
However this will be much slower than the numpy implementation:然而,这将比 numpy 实现慢得多:
norms = pd.Series(np.linalg.norm(df.to_numpy()-signal.to_numpy(), axis=0),
index=df.columns)
With test data of size 100x2, the latter is 20x faster.使用大小为 100x2 的测试数据,后者快 20 倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.