简体   繁体   中英

Filter MultiIndex with Query Strings

I have a fairly large DataFrame, say 600 indexes, and want to use filter criteria to produce a reduced version of the DataFrame where the criteria is true. From the research I've done, filtering works well when you're applying expressions to the data, and already know the index you're operating on. What I want to do, however, is apply the filtering criteria to the index. See example below.

MultiIndex is bold, names of MultiIndex names are italic.

在此处输入图片说明

I'd like to apply the criteria like follows (or something) along these lines:

df = df[MultiIndex.query('base == 115 & Al.isin(stn)')]

Then maybe do something like this:

df = df.transpose()[MultiIndex.query('Fault.isin(cont)')].transpose

To result in:

在此处输入图片说明

I think fundamentally I'm trying to produce a boolean list to mask the MultiIndex. If there is a quick way to apply the pandas query to a 2d list? that would be acceptable. As of now it seems like an option would be to take the MultiIndex, convert it to a DataFrame, then I can apply filtering as I want to get the TF array. I'm concerned that this will be slow though.

As you noticed, indexes aren't great for querying using filter expressions. There's df.filter() but it doesn't really seem to work well on a MultiIndex.

You can still filter the MultiIndex values as an iterable of Python tuples, and then use .loc to access the filtered results.

This works:

rows = [(season, cont)
        for (season, cont) in df.index
        if 'Fault' in cont]
cols = [(stn, base)
        for (stn, base) in df.columns
        if base == 115 and 'Al' in stn]
df.loc[rows, cols]

If what you're after is using the df.query() nifty syntax to slice your data, then you're better off "unpivoting" your DataFrame, turning all indices and column labels into regular fields.

You can create an "unpivot" DataFrame with:

df_unpivot = df.stack(level=[0, 1]).rename('value').reset_index()

Which will produce a DataFrame that looks like this:

  season cont  stn   base value
0 Summer Fault Alpha  115   1.0
1 Summer Fault Beta   115   0.8
2 Summer Fault Gamma  230   0.7
3 Summer Trip  Alpha  115   1.2
4 Summer Trip  Beta   115   0.9
...

Which you can then query with:

df_unpivot.query(
    'cont.str.contains("Fault") and '
    'stn.str.contains("Al") and '
    'base == 115'
)

Which produces:

  season cont  stn   base value
0 Summer Fault Alpha  115   1.0
6 Winter Fault Alpha  115   0.7

Which is the two values you were expecting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM