简体   繁体   中英

Drop rows from a slice of Multi-Index DataFrame based on boolean

EDIT: Upon request I provide an example that is closer to the real data I am working with.

So I have a table data that looks something like

            value0    value1    value2
run step                              
0   0      0.12573 -0.132105  0.640423
    1       0.1049 -0.535669  0.361595
    2        1.304  0.947081 -0.703735
    3    -1.265421 -0.623274  0.041326
    4    -2.325031 -0.218792 -1.245911
    5    -0.732267 -0.544259   -0.3163
1   0     0.411631  1.042513 -0.128535
    1     1.366463 -0.665195   0.35151
    2      0.90347  0.094012 -0.743499
    3    -0.921725 -0.457726  0.220195
    4    -1.009618 -0.209176 -0.159225
    5     0.540846  0.214659  0.355373

(think: collection of time series) and a second table valid_range

    start   stop
run         
0   1   3
1   2   5

For each run I want to drop all rows that do not satisfy start≤step≤stop .

I tried the following (table generating code at the end)

for idx in valid_range.index:
    slc = data.loc[idx]
    start, stop = valid_range.loc[idx]
    cond = (start <= slc.index) & (slc.index <= stop)
    data.loc[idx] = data.loc[idx][cond]

However, this results in:

         value0 value1 value2
run step                     
0   0       NaN    NaN    NaN
    1       NaN    NaN    NaN
    2       NaN    NaN    NaN
    3       NaN    NaN    NaN
    4       NaN    NaN    NaN
    5       NaN    NaN    NaN
1   0       NaN    NaN    NaN
    1       NaN    NaN    NaN
    2       NaN    NaN    NaN
    3       NaN    NaN    NaN
    4       NaN    NaN    NaN
    5       NaN    NaN    NaN

I also tried data.loc[idx].drop(slc[cond].index, inplace=True) but it didn't have any effect...


Generating code for table

import numpy as np
from pandas import DataFrame, MultiIndex, Index
rng = np.random.default_rng(0)

valid_range = DataFrame({"start": [1, 2], "stop":[3, 5]}, index=Index(range(2), name="run"))
midx = MultiIndex(levels=[[],[]], codes=[[],[]], names=["run", "step"])
data = DataFrame(columns=[f"value{k}" for k in range(3)], index=midx)

for run in range(2):
    for step in range(6):
        data.loc[(run, step), :] = rng.normal(size=(3))
)

First, merge data and valid range based on 'run', using the merge method

>>> data

            value0     value1    value2
run step                               
0   0      0.12573  -0.132105  0.640423
    1       0.1049  -0.535669  0.361595
    2        1.304   0.947081 -0.703735
    3     -1.26542  -0.623274  0.041326
    4     -2.32503  -0.218792  -1.24591
    5    -0.732267  -0.544259   -0.3163
1   0     0.411631    1.04251 -0.128535
    1      1.36646  -0.665195   0.35151
    2      0.90347  0.0940123 -0.743499
    3    -0.921725  -0.457726  0.220195
    4     -1.00962  -0.209176 -0.159225
    5     0.540846   0.214659  0.355373


>>> valid_range

     start  stop
run             
0        1     3
1        2     5

>>> merged = data.reset_index().merge(valid_range, how='left', on='run')
>>> merged 

    run  step    value0     value1    value2  start  stop
0     0     0   0.12573  -0.132105  0.640423      1     3
1     0     1    0.1049  -0.535669  0.361595      1     3
2     0     2     1.304   0.947081 -0.703735      1     3
3     0     3  -1.26542  -0.623274  0.041326      1     3
4     0     4  -2.32503  -0.218792  -1.24591      1     3
5     0     5 -0.732267  -0.544259   -0.3163      1     3
6     1     0  0.411631    1.04251 -0.128535      2     5
7     1     1   1.36646  -0.665195   0.35151      2     5
8     1     2   0.90347  0.0940123 -0.743499      2     5
9     1     3 -0.921725  -0.457726  0.220195      2     5
10    1     4  -1.00962  -0.209176 -0.159225      2     5
11    1     5  0.540846   0.214659  0.355373      2     5

Then select the rows which satisfy the condition using eval . Use the boolean array to mask data

>>> cond = merged.eval('start < step < stop').to_numpy()
>>> data[cond]

            value0    value1    value2
run step                              
0   2        1.304  0.947081 -0.703735
1   3    -0.921725 -0.457726  0.220195
    4     -1.00962 -0.209176 -0.159225

Or if you want, here is a similar approach using query

res = (
    data.reset_index()
        .merge(valid_range, on='run', how='left')
        .query('start < step < stop')
        .drop(columns=['start','stop'])
        .set_index(['run', 'step'])
)

I would go on groupby like this:

(df.groupby(level=0)
   .apply(lambda x: x[x['small']>1])
   .reset_index(level=0, drop=True)    # remove duplicate index
)

which gives:

                           big  small
animal animal attribute              
cow    cow    speed       30.0   20.0
              weight     250.0  150.0
falcon falcon speed      320.0  250.0
lama   lama   speed       45.0   30.0
              weight     200.0  100.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM