简体   繁体   中英

How to slice a pandas DataFrame based on a subset of the levels in a MultiIndex

I commonly work with large DataFrames that have an index with many levels, which I want to slice based on a subset of the levels. I do not think there is a straightforward way to do this. In particular, pandas.IndexSlice does NOT provide the desired results, as I'll explain below.

Let's say we have a DataFrame like this:

                      col0  col1
level0 level1 level2            
0      0      0          0     0
              1          1     1
       1      0          2     2
              1          3     3
1      0      0          4     4
              1          5     5
       1      0          6     6
              1          7     7

I wish that I could slice it like this:

# This doesn't work!
df.loc[[
    (0, 1), 
    (1, 0),
    ]]
# ValueError: operands could not be broadcast together with shapes (2,2) (3,) (2,2)

The desired result is this:

                      col0  col1
level0 level1 level2            
0      1      0          2     2
              1          3     3
1      0      0          4     4
              1          5     5

IndexSlice does something different, NOT what is desired here:

df.loc[pandas.IndexSlice[[0, 1], [1, 0], :]]

It gives all combinations of the desired levels, rather than just the desired levels.

I'm going to post my own answer with some workarounds that I've figured out, but none are perfect, so please post any other ideas.

Here is the code that generates the data:

import pandas
import numpy as np

# Size of the problem
n_levels = 3
n_values_per_level = 2

# Build an example MultiIndex
midx = pandas.MultiIndex.from_product(
    [range(n_values_per_level)] * n_levels,
    names=['level{}'.format(level) for level in range(n_levels)]
)

# Generate data of the appropriate number of rows
df = pandas.DataFrame(
    np.transpose([np.arange(len(midx))] * 2), 
    columns=['col0', 'col1'],
    index=midx)

Boolean indexing

It seems that boolean indexing automatically supports selections based on a subset of the levels, so another option is to convert the desired indices to a boolean mask.

slicing_midx = pandas.MultiIndex.from_tuples([(0, 1), (1, 0)],
  names=['level0', 'level1']
)

select = pandas.Series(True, index=slicing_midx).reindex(
  df.index.droplevel(df.index.names.difference(slicing_midx.names)).unique(),
  fill_value=False
)

res = df.loc[select]

Here are a few workarounds I've found, none perfect:

Unstacking

slicing_midx = pandas.MultiIndex.from_tuples([(0, 1), (1, 0)], 
    names=['level0', 'level1'])
res = df.unstack('level3').loc[slicing_midx].stack('level3')

This works. The down side is that it creates an intermediate data structure that can be extremely large. In the worst case (when level3 contains no duplicate values), the intermediate structure is ~squared the size of the original.

Resetting index

This solution was proposed by @anky_91. Reset the index into the data columns, then append to index again after slicing.

# The levels to slice on, in sorted order
slicing_levels = list(slicing_midx.names)

# The levels not to slice on
non_slicing_levels = [level for level in df.index.names 
    if level not in slicing_levels]

# Reset the unneeded index
res = df.reset_index(non_slicing_levels).loc[
    slicing_midx].set_index(non_slicing_levels, append=True)

This is pretty efficient. The only downside I can think of is that it might mess up a MultiIndex on the columns if there is one (need to check this).

Index individually and concat

slicing_midx = pandas.MultiIndex.from_tuples([(0, 1), (1, 0)], 
    names=['level0', 'level1'])
res = pandas.concat([df.loc[idx] for idx in slicing_midx],
    keys=slicing_midx, names=slicing_midx.names)

This works. But it can be very slow for large DataFrames because each element must be individually indexed. It also drops the level names, for some reason.

Probably this is the fastest if len(slicing_midx) << len(df)

Merge/compare the MultiIndex

This compares the indexes with pandas.merge, masks, and slices. I believe this is the most efficient, but it's also cumbersome.

slicing_midx = pandas.MultiIndex.from_tuples([(0, 1), (1, 0)], 
    names=['level0', 'level1'])
df1 = slicing_midx.to_frame().reset_index(drop=True)
df2 = df.index.to_frame().reset_index(drop=True)
df1['key'] = 1
mask = ~pandas.merge(
    df2, df1, on=['level0', 'level1'], how='left')[
    'key'].isnull()
res = df.loc[mask.values]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM