[英]Compute distances between 2 dataframes based on boolean matrix as a mask
I have 2 dataframes where columns are features and rows are different items.我有 2 个数据框,其中列是特征,行是不同的项目。
import pandas as pd
import numpy as np
import random
random.seed(0)
data1 = {'x':random.sample(range(1,100), 4), 'y':random.sample(range(1,100), 4),
'size':random.sample(range(1,20), 4), 'weight':random.sample(range(1,20), 4),
'volume':random.sample(range(1,50), 4)}
data2 = {'x':random.sample(range(1,100), 6), 'y':random.sample(range(1,100), 6),
'size':random.sample(range(1,10), 6), 'weight':random.sample(range(1,10), 6),
'volume':random.sample(range(1,20), 6)}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Here I need to create a mask.在这里,我需要创建一个蒙版。 I will compute distances only between items for which df1['size'] > df2['size'] & df1['weight'] > df2['weight'] & df1['volume'] > df2['volume'] .
我将只计算df1['size'] > df2['size'] & df1['weight'] > df2['weight'] & df1['volume'] > df2['volume'] 的项目之间的距离. This would give here a (6,4) boolean array.
这将在这里给出一个 (6,4) 布尔数组。
Then, I need to compute the Euclidean distance between items of df1 and items of df2 where the condition above is True.然后,我需要计算 df1 项和 df2 项之间的欧几里得距离,其中上述条件为真。 For the False cases, no need to compute the distance and +Inf can be put instead in the array.
对于 False 情况,不需要计算距离,+Inf 可以放在数组中。
My intuition is to use numpy broadcast and np.einsum for the distance because this should be the fastest.我的直觉是使用 numpy broadcast 和 np.einsum 作为距离,因为这应该是最快的。 Runtime is priority 1.
运行时优先级为 1。
Thanks for your time and help.感谢您的时间和帮助。
Example: df1 =示例:df1 =
x y size weight volume
50 34 10 17 49
98 66 16 5 7
54 63 12 10 40
6 52 7 18 17
df2 = df2 =
x y size weight volume
69 94 2 9 18
91 10 6 8 1
78 88 4 4 3
19 43 3 5 13
40 61 5 3 12
13 72 9 1 14
The first step (that does not have to be explicit) is to build the mask based on size, weight, and volume being greater in df1:第一步(不必是明确的)是根据 df1 中更大的尺寸、重量和体积构建掩码:
df2.0 df2.1 df2.2 df2.3 df2.4 df2.5
df1.0 1 1 1 1 1 1
df1.1 0 0 1 0 0 0
df1.2 1 1 1 1 1 1
df1.3 0 1 1 1 1 0
The final result expected is then:预期的最终结果是:
df2.0 df2.1 df2.2 df2.3 df2.4 df2.5
df1.0 62.94 47.51 60.83 32.28 28.79 53.04
df1.1 Inf Inf 24.17 Inf Inf Inf
df1.2 48.27 64.64 34.66 40.31 14.14 41.98
df1.3 Inf 94.81 80.50 15.81 35.17 Inf
Is this something you are looking for ?这是您要找的东西吗?
for i in range (len(df2)-len(df1)):
df1=df1.append(pd.Series(), ignore_index=True) # Making the df1 & 2 identical shape
dft1=df1[np.logical_and(np.logical_and(df1['size']>df2['size'],df1['weight']>df2['weight']),df1['volume']>df2['volume'])]
dft2=df2[np.logical_and(np.logical_and(df1['size']>df2['size'],df1['weight']>df2['weight']),df1['volume']>df2['volume'])]
print(np.linalg.norm(dft1 - dft2))
output输出
90.39358384310249
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.