计算多索引熊猫数据框中值出现的最快方法

Question

I have two multi-index dataframe with many levels and columns. 我有两个具有多个级别和列的多索引数据框。 I'm looking for the quickest way to iterate over the dataframe and count, for each row, how many cells are above a specific value in each dataframe and then find the intersection of the rows of the two dataframe that scored at least one count. 我正在寻找迭代数据框并计数的最快方法，对于每一行，每个数据框中的特定值以上有多少个单元格，然后找到得分至少为一个的两个数据框的行的交点。

Right now I'm cycling through the dataframe with a combination of for loop and groupby but it's taking me too much time to find the right answer (my real dataframe comprise thousands of levels and hundreds of columns) so I need to find a different way to do this. 现在，我正在使用for循环和groupby组合遍历数据帧，但是花了我太多时间才能找到正确的答案（我的实际数据帧包含数千个级别和数百列），所以我需要找到一种不同的方法去做这个。

So for example: 因此，例如：

idx = pd.MultiIndex.from_product([[0,1],[0,1,2]],names= 
['index_1','index_2'])
 col = ['column_1', 'column_2']


values_list_a=[[1,2],[2,2],[2,1],[-8,1],[2,0],[2,1]]
DFA = pd.DataFrame(values_list_a, idx, col)

DFA:
                   columns_1 columns2
index_1 index_2
  0       0            1        2
          1            2        2
          2            2        1
  1       0            -8       1
          1            2        0
          2            2        1

values_list_b=[[2,2],[0,1],[2,2],[2,2],[1,0],[1,2]]
DFB = pd.DataFrame(values_list_b, idx, col)

DFB:
                   columns_1 columns2
index_1 index_2
  0       0            2        2
          1            0        1
          2            2        2
  1       0            2        2
          1            1        0
          2            1        2

What I expect is: 我期望的是：

Step 1 counting occurrence: 步骤1计算发生次数：

DFA:
                   columns_1 columns2 counts
index_1 index_2
  0       0            1        2       1
          1            2        2       2
          2            2        1       1
  1       0            -8       1       0
          1            2        0       1
          2            2        1       1

DFB:
                   columns_1 columns2 counts
index_1 index_2
  0       0            2        2        2
          1            0        1        0
          2            2        2        2
  1       0            2        2        2
          1            1        0        0
          2            1        2        1

Step 2: The intersection of the 2 dataframe with counts >0 should create a new dataframe like this (where the row of both dataframe that score at least one count in the same indices are recorded and a new index_0 level is added). 步骤2：计数> 0的2个数据框的交集应创建一个这样的新数据框（在该两个数据框中，在相同索引中得分至少为1的行的记录被记录，并添加新的index_0级别）。 the index_0 = 0 should refer to the DFA and index_0=1 to DFB: index_0 = 0应该引用DFA，index_0 = 1引用DFB：

DFC:
                            columns_1 columns2 counts
  index_0 index_1 index_2
     0       0       0            1        2       1
                     2            2        1       1
             1       2            2        1       1

     1       0       0            2        2       2
                     2            2        2       2
             1       2            1        2       1

Answer 1

df.groupby(['index_0','index_1', 'index2'])

now , you want to use having equivalent of sql, which is 现在，您要使用具有等同于sql的

df.filter(lambda x: len(x.columns_1) > 2)
df.count()

this is a concept , i didnt understand what do you want to filter, note that x is a group, so you need to operate on it (len, set, values) etc 这是一个概念，我不明白您要过滤的内容，请注意x是一个组，因此您需要对其进行操作（len，set，values）等

Answer 2

Use filter, .any() and pd.merge() 使用filter，.any（）和pd.merge（）

Recreate the dataframe: 重新创建数据框：

idx = pd.MultiIndex.from_product([[0,1],[0,1,2]], names=['one', 'two'])
columns = ['columns_1', 'columns_2']

DFA = pd.DataFrame(np.random.randint(-1,3, size=[6,2]), idx, columns)
DFB = pd.DataFrame(np.random.randint(-1,3, size=[6,2]), idx, columns)

print(DFA)

             columns_1  columns_2
one two                      
0   0           -1          2
    1            2         -1
    2           -1          0
1   0            1          2
    1            0          0
    2           -1         -1



print(DFB)

             columns_1  columns_2
one two                      
0   0            2         -1
    1            1          2
    2            2          1
1   0            0          0
    1           -1          2
    2            1         -1

Filter the dataframes for values > 1 in this instance. 在这种情况下，请过滤数据框以获取大于1的值。

DFA = DFA.loc[(DFA>1).any(bool_only=True, axis=1),:]
DFB = DFB.loc[(DFB>1).any(bool_only=True, axis=1),:]

print(DFA)

             columns_1  columns_2
one two                      
0   0           -1          2
    1            2         -1
1   0            1          2

print(DFB)

        columns_1  columns_2
one two                      
0   0            2         -1
    1            1          2
    2            2          1
1   1           -1          2

Merge the two together. 将两者合并。 Using out join gets you close. 使用out join使您接近。 Not sure about jumping the index out, but the first level 0 [0,1] is DFA. 不确定是否要跳出索引，但是第一级0 [0,1]是DFA。

         columns_1_x  columns_2_x  columns_1_y  columns_2_y
one two                                                    
0   0           -1.0          2.0          2.0         -1.0
    1            2.0         -1.0          1.0          2.0
1   0            1.0          2.0          NaN          NaN
0   2            NaN          NaN          2.0          1.0
1   1            NaN          NaN         -1.0          2.0

Answer 3

`pd.concat` then `magic` `pd.concat`然后`magic`

def f(d, thresh=1):
    c = d.gt(thresh).sum(1)
    mask = c.gt(0).groupby(level=[1, 2]).transform('all')
    return d.assign(counts=c)[mask]

pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)

                         column_1  column_2  counts
index_0 index_1 index_2                            
bar     0       0               1         2       1
                2               2         1       1
        1       2               2         1       1
foo     0       0               2         2       2
                2               2         2       2
        1       2               1         2       1

With Comments 有评论

def f(d, thresh=1):
    # count how many are greater than a threshold `thresh` per row
    c = d.gt(thresh).sum(1)

    # find where `counts` are > `0` for both dataframes
    # conveniently dropped into one dataframe so we can do
    # this nifty `groupby` trick
    mask = c.gt(0).groupby(level=[1, 2]).transform('all')
    #                                    \-------/
    #                         This is key to broadcasting over 
    #                         original index rather than collapsing
    #                         over the index levels we grouped by

    #     create a new column named `counts`
    #         /------------\ 
    return d.assign(counts=c)[mask]
    #                         \--/
    #                    filter with boolean mask

# Use concat to smash two dataframes together into one
pd.concat({'bar': DFA, 'foo': DFB}, names=['index_0']).pipe(f)

计算多索引熊猫数据框中值出现的最快方法

问题描述

3 个解决方案

解决方案1
0 2019-06-06 18:01:00

解决方案2
0 2019-06-06 18:44:54

解决方案3
0 已采纳 2019-06-06 19:08:41

`pd.concat` then `magic` `pd.concat`然后`magic`

With Comments 有评论

计算多索引熊猫数据框中值出现的最快方法

问题描述

3 个解决方案

解决方案1 0 2019-06-06 18:01:00

解决方案2 0 2019-06-06 18:44:54

解决方案3 0 已采纳 2019-06-06 19:08:41

pd.concat then magic pd.concat然后magic

With Comments 有评论

解决方案1
0 2019-06-06 18:01:00

解决方案2
0 2019-06-06 18:44:54

解决方案3
0 已采纳 2019-06-06 19:08:41

`pd.concat` then `magic` `pd.concat`然后`magic`