Calculating and using Euclidean Distance in Python

Question

I am trying to calculate the Euclidean Distance between two datasets in python. I can do this using the following:

np.linalg.norm(df-signal)

With df and signal being my two datasets. This returns a single numerical value (ie, 8258155.579535276), which is fine. My issue is that I want it to return the difference between each column in the dataset. Something like this:

AFNLWGT     4.867376e+10
AGI         3.769233e+09
EMCONTRB    1.202935e+07
FEDTAX      8.095078e+07
PTOTVAL     2.500056e+09
STATETAX    1.007451e+07
TAXINC      2.027124e+09
POTHVAL     1.158428e+08
INTVAL      1.606913e+07
PEARNVAL    2.038357e+09
FICA        1.080950e+07
WSALVAL     1.986075e+09
ERNVAL      1.905109e+09

I'm fairly new to Python so would really appreciate any help possible.

Answer 1

To have the columnwise norm with column headers you can use pandas.DataFrame.aggregate together with np.linalg.norm :

import pandas as pd
import numpy as np

norms = (df-signal).aggregate(np.linalg.norm)

Notice that, by default, .aggregate operates along the 0-axis (hence columns).

However this will be much slower than the numpy implementation:

norms = pd.Series(np.linalg.norm(df.to_numpy()-signal.to_numpy(), axis=0), 
                  index=df.columns)

With test data of size 100x2, the latter is 20x faster.

Calculating and using Euclidean Distance in Python

Question

1 answers

solution1
2 ACCPTED 2020-04-11 11:47:23

Calculating and using Euclidean Distance in Python

Question

1 answers

solution1 2 ACCPTED 2020-04-11 11:47:23

solution1
2 ACCPTED 2020-04-11 11:47:23