简体   繁体   English

基于布尔矩阵作为掩码计算 2 个数据帧之间的距离

[英]Compute distances between 2 dataframes based on boolean matrix as a mask

I have 2 dataframes where columns are features and rows are different items.我有 2 个数据框,其中列是特征,行是不同的项目。

import pandas as pd 
import numpy as np
import random

random.seed(0) 
data1 = {'x':random.sample(range(1,100), 4), 'y':random.sample(range(1,100), 4), 
'size':random.sample(range(1,20), 4), 'weight':random.sample(range(1,20), 4), 
'volume':random.sample(range(1,50), 4)} 

data2 = {'x':random.sample(range(1,100), 6), 'y':random.sample(range(1,100), 6), 
'size':random.sample(range(1,10), 6), 'weight':random.sample(range(1,10), 6), 
'volume':random.sample(range(1,20), 6)}   

df1 = pd.DataFrame(data1) 
df2 = pd.DataFrame(data2) 

Here I need to create a mask.在这里,我需要创建一个蒙版。 I will compute distances only between items for which df1['size'] > df2['size'] & df1['weight'] > df2['weight'] & df1['volume'] > df2['volume'] .我将只计算df1['size'] > df2['size'] & df1['weight'] > df2['weight'] & df1['volume'] > df2['volume'] 的项目之间的距离. This would give here a (6,4) boolean array.这将在这里给出一个 (6,4) 布尔数组。

Then, I need to compute the Euclidean distance between items of df1 and items of df2 where the condition above is True.然后,我需要计算 df1 项和 df2 项之间的欧几里得距离,其中上述条件为真。 For the False cases, no need to compute the distance and +Inf can be put instead in the array.对于 False 情况,不需要计算距离,+Inf 可以放在数组中。

My intuition is to use numpy broadcast and np.einsum for the distance because this should be the fastest.我的直觉是使用 numpy broadcast 和 np.einsum 作为距离,因为这应该是最快的。 Runtime is priority 1.运行时优先级为 1。

Thanks for your time and help.感谢您的时间和帮助。

Example: df1 =示例:df1 =

x   y   size    weight  volume
50  34  10      17      49
98  66  16       5       7
54  63  12      10      40
 6  52   7      18      17

df2 = df2 =

x   y   size    weight  volume
69  94  2       9       18
91  10  6       8        1
78  88  4       4        3
19  43  3       5       13
40  61  5       3       12
13  72  9       1       14
    

The first step (that does not have to be explicit) is to build the mask based on size, weight, and volume being greater in df1:第一步(不必是明确的)是根据 df1 中更大的尺寸、重量和体积构建掩码:

      df2.0   df2.1   df2.2   df2.3   df2.4   df2.5
df1.0     1       1       1       1       1       1
df1.1     0       0       1       0       0       0
df1.2     1       1       1       1       1       1
df1.3     0       1       1       1       1       0

The final result expected is then:预期的最终结果是:

      df2.0   df2.1   df2.2   df2.3   df2.4   df2.5
df1.0 62.94   47.51   60.83   32.28   28.79   53.04
df1.1   Inf     Inf   24.17     Inf     Inf     Inf
df1.2 48.27   64.64   34.66   40.31   14.14   41.98
df1.3   Inf   94.81   80.50   15.81   35.17     Inf

Is this something you are looking for ?这是您要找的东西吗?

for i in range (len(df2)-len(df1)): 
      df1=df1.append(pd.Series(), ignore_index=True)  # Making the df1 & 2 identical shape
dft1=df1[np.logical_and(np.logical_and(df1['size']>df2['size'],df1['weight']>df2['weight']),df1['volume']>df2['volume'])]
dft2=df2[np.logical_and(np.logical_and(df1['size']>df2['size'],df1['weight']>df2['weight']),df1['volume']>df2['volume'])]
print(np.linalg.norm(dft1 - dft2))

output输出

90.39358384310249

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM