简体   繁体   中英

Pandas: Apply mask to multiindex dataframe

I have a pandas dataframe with MultiIndex columns, with 3 levels:

import itertools
import numpy as np

def mklbl(prefix, n):
    return ["%s%s" % (prefix, i) for i in range(n)]


miindex = pd.MultiIndex.from_product([mklbl('A', 4)])

micolumns = pd.MultiIndex.from_tuples(list(itertools.product(['A', 'B'], ['a', 'b', 'c'], ['foo', 'bar'])),
                                      names=['lvl0', 'lvl1', 'lvl2'])

dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns)).reshape((len(miindex), len(micolumns))),
                    index=miindex,
                    columns=micolumns).sort_index().sort_index(axis=1)

lvl0   A                       B                    
lvl1   a       b       c       a       b       c    
lvl2 bar foo bar foo bar foo bar foo bar foo bar foo
A0     1   0   3   2   5   4   7   6   9   8  11  10
A1    13  12  15  14  17  16  19  18  21  20  23  22
A2    25  24  27  26  29  28  31  30  33  32  35  34
A3    37  36  39  38  41  40  43  42  45  44  47  46

I want to mask this dataframe, based on another dataframe, which has the last two levels of the index:

cols = micolumns.droplevel(0).unique()
a_mask = pd.DataFrame(np.random.randn(len(dfmi.index), len(cols)), index=dfmi.index, columns=cols)
a_mask = (np.sign(a_mask) > 0).astype(bool)

        a             b             c       
      foo    bar    foo    bar    foo    bar
A0  False  False  False   True   True  False
A1   True  False   True  False   True   True
A2   True   True   True   True  False  False
A3   True  False  False   True   True  False

What I would like to do is to mask the original dataframe according to a_mask . Let's say I want to set the original entries to zero, when a_mask is true.

I tried to use pd.IndexSlice , but it fails silently (ie I can run the following code, but has no effect:

dfmi.loc[:, pd.IndexSlice[:, a_mask]] = 0  #dfmi is unchanged

Any suggestion how to achieve this?

Edit In my use case, the labels are constructed with a cartesian product, so there will be all combinations of (lev0, lev1, lev2). But it is the case that lev0 can assume 2 values {A, B}, while lev1 can assume 3 values {a, b, c}

I think using this way is more safe.

dfmi.where(a_mask.loc[:,dfmi.columns.droplevel(0)].values,0)
Out[191]: 
lvl0   A               B            
lvl1   a       b       a       b    
lvl2 bar foo bar foo bar foo bar foo
A0     0   0   0   2   0   0   0   6
A1     9   8  11   0  13  12  15   0
A2     0  16  19  18   0  20  23  22
A3    25   0   0   0  29   0   0   0

I would do it as follows:

mask = pd.concat({k: a_mask for k in dfmi.columns.levels[0]}, axis=1)
dfmi.where(~mask, 0)

Working with the underlying array data for in-situ edit for memory efficiency (doesn't create any other dataframe) -

d = len(dfmi.columns.levels[0])
n = dfmi.shape[1]//d
for i in range(0,d*n,n):
    dfmi.values[:,i:i+n][a_mask] = 0

Sample run -

In [833]: dfmi
Out[833]: 
lvl0   A                       B                    
lvl1   a       b       c       a       b       c    
lvl2 bar foo bar foo bar foo bar foo bar foo bar foo
A0     1   0   3   2   5   4   7   6   9   8  11  10
A1    13  12  15  14  17  16  19  18  21  20  23  22
A2    25  24  27  26  29  28  31  30  33  32  35  34
A3    37  36  39  38  41  40  43  42  45  44  47  46

In [834]: a_mask
Out[834]: 
        a             b             c       
      foo    bar    foo    bar    foo    bar
A0   True   True   True  False  False  False
A1  False   True  False  False   True  False
A2  False   True   True   True  False  False
A3  False  False  False  False  False   True

In [835]: d = len(dfmi.columns.levels[0])
     ...: n = dfmi.shape[1]//d
     ...: for i in range(0,d*n,n):
     ...:     dfmi.values[:,i:i+n][a_mask] = 0

In [836]: dfmi
Out[836]: 
lvl0   A                       B                    
lvl1   a       b       c       a       b       c    
lvl2 bar foo bar foo bar foo bar foo bar foo bar foo
A0     0   0   0   2   5   4   0   0   0   8  11  10
A1    13   0  15  14   0  16  19   0  21  20   0  22
A2    25   0   0   0  29  28  31   0   0   0  35  34
A3    37  36  39  38  41   0  43  42  45  44  47   0

Updated solution more roboust not hardcode for level values:

lvl0_values = dfmi.columns.get_level_values(0).unique()
pd.concat([dfmi[i].mask(a_mask.rename_axis(['lvl1','lvl2'],axis=1),0) for i in lvl0_values],
          keys=lvl0_values, axis=1)

Output:

lvl0   A               B            
lvl1   a       b       a       b    
lvl2 bar foo bar foo bar foo bar foo
A0     1   0   0   0   5   0   0   0
A1     9   0  11   0  13   0  15   0
A2    17  16  19   0  21  20  23   0
A3     0  24   0  26   0  28   0  30

One way you can do this:

pd.concat([dfmi['A'].mask(a_mask.rename_axis(['lvl1','lvl2'],axis=1),0),
           dfmi['B'].mask(a_mask.rename_axis(['lvl1','lvl2'],axis=1),0)],
           keys=['A','B'], axis=1)

print(a_mask)

lvl1      a             b       
lvl2    foo    bar    foo    bar
A0     True  False   True   True
A1     True  False   True  False
A2    False  False   True  False
A3    False   True  False   True

Output:

       A               B            
lvl1   a       b       a       b    
lvl2 bar foo bar foo bar foo bar foo
A0     1   0   0   0   5   0   0   0
A1     9   0  11   0  13   0  15   0
A2    17  16  19   0  21  20  23   0
A3     0  24   0  26   0  28   0  30

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM