比较两个pandas数据帧行的最快方法？

Question

所以我有两个pandas数据帧，A和B.

A是1000行×500列，填充有表示存在或不存在的二进制值。

B是1024行×10列，并且是0和1的完整迭代，因此具有1024行。

我试图找到A中特定10列的A中的哪些行与B中的给定行相对应。我需要整行匹配，而不是逐个元素。

例如，我想要

A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]

返回A中的行(3,5,8,11,15)与那些特定列的B (1,0,1,0,1,0,0,1,0,0)行匹配的内容(1,2,3,4,5,6,7,8,9,10)

我想在B的每一行都这样做。我能想到的最好方法是：

import numpy as np
for i in B:
    B_array = np.array(i)
    Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
    Matching_Rows_Index = Matching_Rows.index

对于一个实例来说这并不可怕，但我在一个运行大约20,000次的while循环中使用它; 因此，它减慢了相当多的速度。

我一直在乱用DataFrame.apply无济于事。 地图工作能更好吗？

我只是希望有人看到一些显然更高效的东西，因为我对python很新。

谢谢和最好的问候！

Answer 1

当将每一行视为可以转换为十进制数当量的二进制数序列时，我们可以滥用两个数据帧都具有二进制值0或1这一事实，方法是将A的相关列和B所有列折叠成1D数组。 这应该可以大大减少问题集，这有助于提高性能。 现在，在获得这些1D数组之后，我们可以使用np.in1d在A查找B中A匹配，最后在其上np.where以获得匹配的索引。

因此，我们会有这样的实现 -

# Setup 1D arrays corresponding to selected cols from A and entire B
S = 2**np.arange(10)
A_ID = np.dot(A[range(1,11)],S)
B_ID = np.dot(B,S)

# Look for matches that exist from B_ID in A_ID, whose indices 
# would be desired row indices that have matched from B
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]

样品运行 -

In [157]: # Setup dataframes A and B with rows 0, 4 in A having matches from B
     ...: A_arr = np.random.randint(0,2,(10,14))
     ...: B_arr = np.random.randint(0,2,(7,10))
     ...: 
     ...: B_arr[2] = A_arr[4,1:11]
     ...: B_arr[4] = A_arr[4,1:11]
     ...: B_arr[5] = A_arr[0,1:11]
     ...: 
     ...: A = pd.DataFrame(A_arr)
     ...: B = pd.DataFrame(B_arr)
     ...: 

In [158]: S = 2**np.arange(10)
     ...: A_ID = np.dot(A[range(1,11)],S)
     ...: B_ID = np.dot(B,S)
     ...: out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
     ...: 

In [159]: out_row_idx
Out[159]: array([0, 4])

Answer 2

您可以将merge与reset_index一起reset_index - 输出是B索引，它们在自定义列中的A中相等：

A = pd.DataFrame({'A':[1,0,1,1],
                  'B':[0,0,1,1],
                  'C':[1,0,1,1],
                  'D':[1,1,1,0],
                  'E':[1,1,0,1]})

print (A)
   A  B  C  D  E
0  1  0  1  1  1
1  0  0  0  1  1
2  1  1  1  1  0
3  1  1  1  0  1

B = pd.DataFrame({'0':[1,0,1],
                  '1':[1,0,1],
                  '2':[1,0,0]})

print (B)
   0  1  2
0  1  1  1
1  0  0  0
2  1  1  0

print (pd.merge(B.reset_index(), 
                A.reset_index(), 
                left_on=B.columns.tolist(), 
                right_on=A.columns[[0,1,2]].tolist(),
                suffixes=('_B','_A')))

   index_B  0  1  2  index_A  A  B  C  D  E
0        0  1  1  1        2  1  1  1  1  0
1        0  1  1  1        3  1  1  1  0  1
2        1  0  0  0        1  0  0  0  1  1    

print (pd.merge(B.reset_index(), 
                A.reset_index(), 
                left_on=B.columns.tolist(), 
                right_on=A.columns[[0,1,2]].tolist(),
                suffixes=('_B','_A'))[['index_B','index_A']])    

   index_B  index_A
0        0        2
1        0        3
2        1        1

Answer 3

您可以使用loc或ix在pandas中执行此操作，并告诉它查找十列完全相同的行。 像这样：

A.loc[(A[1]==B[1]) & (A[2]==B[2]) & (A[3]==B[3]) & A[4]==B[4]) & (A[5]==B[5]) & (A[6]==B[6]) & (A[7]==B[7]) & (A[8]==B[8]) & (A[9]==B[9]) & (A[10]==B[10])]

在我看来，这是相当丑陋的，但它会工作并摆脱循环所以它应该明显更快。 如果有人能够以更优雅的方式编写相同的操作，我不会感到惊讶。

Answer 4

在这种特殊情况下，您的10个零和1行可以解释为10位二进制数。 如果B是有序的，那么它可以被解释为0到1023的范围。在这种情况下，我们需要做的就是将10行列中的A行计算出来并计算它的二进制等价物。

我将首先定义两个幂的范围，以便我可以用它进行矩阵乘法。

twos = pd.Series(np.power(2, np.arange(10)))

接下来，我将A的列重新标记为MultiIndex并stack以获得10块。

A = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A.columns = pd.MultiIndex.from_tuples(zip((A.columns / 10).tolist(), (A.columns % 10).tolist()))
A_ = A.stack(0)

A_.head()

最后，我将A_与twos相乘得到每行的整数表示并unstack 。

A_.dot(twos).unstack()

现在这是一个1000 x 50的DataFrame，其中每个单元格代表我们为该特定A行的特定10列块匹配的B行中的哪一行。甚至不需要B.

比较两个pandas数据帧行的最快方法？

问题描述

4 个解决方案

解决方案1
4 2016-07-08 14:51:17

解决方案2
3 已采纳 2016-07-08 14:09:48

解决方案3
1 2016-07-08 13:59:28

解决方案4
1 2016-07-08 15:17:41

比较两个pandas数据帧行的最快方法？

问题描述

4 个解决方案

解决方案1 4 2016-07-08 14:51:17

解决方案2 3 已采纳 2016-07-08 14:09:48

解决方案3 1 2016-07-08 13:59:28

解决方案4 1 2016-07-08 15:17:41

解决方案1
4 2016-07-08 14:51:17

解决方案2
3 已采纳 2016-07-08 14:09:48

解决方案3
1 2016-07-08 13:59:28

解决方案4
1 2016-07-08 15:17:41