匹配Pandas DataFrames和应用函数之间的ID

Question

I have two data frames that look like the following: 我有两个数据框，如下所示：

df_A: DF_A：

ID    x     y
a     0     0
c     3     2
b     2     5

df_B: DF_B：

ID    x     y
a     2     1
c     3     5
b     1     2

I want to add a column in db_B that is the Euclidean distance between the x,y coordinates in df_B from df_A for each identifier. 我想在db_B中添加一个列，该列是df_B中x，y坐标与df_A之间的欧几里德距离，用于每个标识符。 The desired result would be: 期望的结果是：

ID    x     y    dist
a     2     1    1.732
c     3     5    3
b     1     2    3.162

The identifiers are not necessarily going to be in the same order. 标识符不一定按顺序相同。 I know how to do this by looping through the rows of df_A and finding the matching ID in df_B, but I was hoping to avoid using a for loop since this will be used on data with tens of millions of rows. 我知道如何通过循环遍历df_A行并在df_B中找到匹配的ID来做到这一点，但我希望避免使用for循环，因为这将用于具有数千万行的数据。 Is there some way to use apply but condition it on matching IDs? 是否有某种方法可以使用apply但是在匹配ID时使用它？

Answer 1

If ID isn't the index, make it so. 如果ID不是索引，那么就这样做。

df_B.set_index('ID', inplace=True)
df_A.set_index('ID', inplace=True)

df_B['dist'] = ((df_A - df_B) ** 2).sum(1) ** .5

Since index and columns are already aligned, simply doing the math should just work. 由于索引和列已经对齐，因此只需进行数学运算即可。

Answer 2

Solution which uses sklearn.metrics.pairwise.paired_distances method: 使用sklearn.metrics.pairwise.paired_distances方法的解决方案：

In [73]: A
Out[73]:
    x  y
ID
a   0  0
c   3  2
b   2  5

In [74]: B
Out[74]:
    x  y
ID
a   2  1
c   3  5
b   1  2

In [75]: from sklearn.metrics.pairwise import paired_distances

In [76]: B['dist'] = paired_distances(B, A)

In [77]: B
Out[77]:
    x  y      dist
ID
a   2  1  2.236068
c   3  5  3.000000
b   1  2  3.162278

Answer 3

For performance, you might want to work with NumPy arrays and for euclidean distance computations between corresponding rows, np.einsum would be do it pretty efficiently. 为了提高性能，您可能希望使用NumPy数组并在相应行之间进行欧几里德距离计算， np.einsum可以非常有效地完成。

Incorporating the fixing of rows to make them aligned, here's an implementation - 结合行的固定使它们对齐，这是一个实现 -

# Get sorted row indices for dataframe-A
sidx = df_A.index.argsort()
idx = sidx[df_A.index.searchsorted(df_B.index,sorter=sidx)]

# Sort A rows accordingly and get the elementwise differences against B
s = df_A.values[idx] - df_B.values

# Use einsum to square and sum each row and finally sqrt for distances
df_B['dist'] = np.sqrt(np.einsum('ij,ij->i',s,s))

Sample input, output - 样本输入，输出 -

In [121]: df_A
Out[121]: 
   0  1
a  0  0
c  3  2
b  2  5

In [122]: df_B
Out[122]: 
   0  1
c  3  5
a  2  1
b  1  2

In [124]: df_B  # After code run
Out[124]: 
   0  1      dist
c  3  5  3.000000
a  2  1  2.236068
b  1  2  3.162278

Here's a runtime test comparing einsum against few other counterparts. 这是一个runtime test将einsum与其他少数同行进行比较。

匹配Pandas DataFrames和应用函数之间的ID

问题描述

3 个解决方案

解决方案1
4 已采纳 2016-12-24 22:48:12

解决方案2
3 2016-12-24 23:13:54

解决方案3
1 2016-12-25 07:48:18

匹配Pandas DataFrames和应用函数之间的ID

问题描述

3 个解决方案

解决方案1 4 已采纳 2016-12-24 22:48:12

解决方案2 3 2016-12-24 23:13:54

解决方案3 1 2016-12-25 07:48:18

解决方案1
4 已采纳 2016-12-24 22:48:12

解决方案2
3 2016-12-24 23:13:54

解决方案3
1 2016-12-25 07:48:18