简体   繁体   中英

Calculate euclidean distance between groups in a data frame

I have weekly data for various stores in the following form:

pd.DataFrame({'Store':['S1', 'S1', 'S1', 'S2','S2','S2','S3','S3','S3'], 'Week':[1, 2, 3,1,2,3,1,2,3],
                           'Sales' : [20,30,40,21,31,41,22,32,42],'Cust_count' : [2,4,6,3,5,7,4,6,8]})

   Store Week Sales Cust_count
0   S1   1    20    2
1   S1   2    30    4
2   S1   3    40    6
3   S2   1    21    3
4   S2   2    31    5
5   S2   3    41    7
6   S3   1    22    4
7   S3   2    32    6
8   S3   3    42    8

As you can see the data is at a store week level and I want to calculate euclidean distance between each store for the same week and then take an average of the calculated distance. So for example the calculation for Store S1 and S2 would look as follows:

    For week 1: sqrt((20-21)^2 + (2-3)^2) = sqrt(2)
    For week 2: sqrt((30-31)^2 + (4-5)^2) = sqrt(2)
    For week 3: sqrt((40-41)^2 + (6-7)^2) = sqrt(2)
    The final value for distance between S1 and S2 = sqrt(2) which is calculated as 
average distance of the 3 weeks i.e. (3 * sqrt(2)) / 3 

Finally the output should be as follows:

   S1    S2      S3
S1 0     1.414   2.818
S2 1.414 0       some val
S3 2.818 some val 0

I have some idea about group by function for grouping columns in data frame and scipy.spatial.distance.cdist for calculating euclidean distances, but I am unable to tie up these concepts and come up with a solution.

We can pivot then use numpy to do these calculations

df1  = (df.pivot(index='Store', columns='Week', values=['Sales', 'Cust_count'])
       #  .fillna(0)  # Uncomment if you want to treat missing store-weeks as 0s
       )
arr1 = df1['Sales'].to_numpy()
arr2 = df1['Cust_count'].to_numpy()

data = np.nanmean(np.sqrt(((arr1[None, :] - arr1[:, None])**2 
                         + (arr2[None, :] - arr2[:, None])**2)), 
                  axis=2)

pd.DataFrame(data, index=df1.index, columns = df1.index)

Store        S1        S2        S3
Store                              
S1     0.000000  1.414214  2.828427
S2     1.414214  0.000000  1.414214
S3     2.828427  1.414214  0.000000

For loop with permutations

import itertools
s=list(itertools.permutations(df.Store.unique(), 2))
from scipy import spatial
l=[]
for x in s:
     l.append(np.sqrt(np.mean(np.sum((df[df.Store == x[0]].iloc[:, 2:].values - df[df.Store == x[1]].iloc[:, 2:].values)**2,axis=1),axis=0)))

s=pd.Series(l,index=pd.MultiIndex.from_tuples(s)).unstack()
s
Out[216]: 
          S1        S2        S3
S1       NaN  1.414214  2.828427
S2  1.414214       NaN  1.414214
S3  2.828427  1.414214       NaN

you can first merge on Week to get all the combinations of stores, then calculate the column dist with the euclidean distance and finally pivot_table with the aggfunc='mean' :

df.merge(df, on='Week', how='left', suffixes=('','_'))\
  .assign(dist = lambda x: np.sqrt((x.Sales - x.Sales_)**2 + (x.Cust_count - x.Cust_count_)**2))\
  .pivot_table(index='Store', columns='Store_', values='dist', aggfunc='mean')

Store_        S1        S2        S3
Store                               
S1      0.000000  1.414214  2.828427
S2      1.414214  0.000000  1.414214
S3      2.828427  1.414214  0.000000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM