简体   繁体   English

如何根据 MultiIndex 中的级别子集对 Pandas DataFrame 进行切片

[英]How to slice a pandas DataFrame based on a subset of the levels in a MultiIndex

I commonly work with large DataFrames that have an index with many levels, which I want to slice based on a subset of the levels.我通常使用具有多个级别的索引的大型 DataFrame,我想根据级别的子集对其进行切片。 I do not think there is a straightforward way to do this.我不认为有一种直接的方法可以做到这一点。 In particular, pandas.IndexSlice does NOT provide the desired results, as I'll explain below.特别是, pandas.IndexSlice不提供所需的结果,我将在下面解释。

Let's say we have a DataFrame like this:假设我们有一个这样的 DataFrame:

                      col0  col1
level0 level1 level2            
0      0      0          0     0
              1          1     1
       1      0          2     2
              1          3     3
1      0      0          4     4
              1          5     5
       1      0          6     6
              1          7     7

I wish that I could slice it like this:我希望我能像这样切片:

# This doesn't work!
df.loc[[
    (0, 1), 
    (1, 0),
    ]]
# ValueError: operands could not be broadcast together with shapes (2,2) (3,) (2,2)

The desired result is this:想要的结果是这样的:

                      col0  col1
level0 level1 level2            
0      1      0          2     2
              1          3     3
1      0      0          4     4
              1          5     5

IndexSlice does something different, NOT what is desired here: IndexSlice做了一些不同的事情,而不是这里想要的:

df.loc[pandas.IndexSlice[[0, 1], [1, 0], :]]

It gives all combinations of the desired levels, rather than just the desired levels.它给出了所需级别的所有组合,而不仅仅是所需级别。

I'm going to post my own answer with some workarounds that I've figured out, but none are perfect, so please post any other ideas.我将发布我自己的答案,其中包含一些我想出的解决方法,但没有一个是完美的,因此请发布任何其他想法。

Here is the code that generates the data:下面是生成数据的代码:

import pandas
import numpy as np

# Size of the problem
n_levels = 3
n_values_per_level = 2

# Build an example MultiIndex
midx = pandas.MultiIndex.from_product(
    [range(n_values_per_level)] * n_levels,
    names=['level{}'.format(level) for level in range(n_levels)]
)

# Generate data of the appropriate number of rows
df = pandas.DataFrame(
    np.transpose([np.arange(len(midx))] * 2), 
    columns=['col0', 'col1'],
    index=midx)

Boolean indexing布尔索引

It seems that boolean indexing automatically supports selections based on a subset of the levels, so another option is to convert the desired indices to a boolean mask.似乎布尔索引自动支持基于级别子集的选择,因此另一种选择是将所需的索引转换为布尔掩码。

slicing_midx = pandas.MultiIndex.from_tuples([(0, 1), (1, 0)],
  names=['level0', 'level1']
)

select = pandas.Series(True, index=slicing_midx).reindex(
  df.index.droplevel(df.index.names.difference(slicing_midx.names)).unique(),
  fill_value=False
)

res = df.loc[select]

Here are a few workarounds I've found, none perfect:以下是我发现的一些解决方法,但都不完美:

Unstacking拆垛

slicing_midx = pandas.MultiIndex.from_tuples([(0, 1), (1, 0)], 
    names=['level0', 'level1'])
res = df.unstack('level3').loc[slicing_midx].stack('level3')

This works.这有效。 The down side is that it creates an intermediate data structure that can be extremely large.不利的一面是它创建了一个可能非常大的中间数据结构。 In the worst case (when level3 contains no duplicate values), the intermediate structure is ~squared the size of the original.在最坏的情况下(当 level3 不包含重复值时),中间结构是原始大小的平方。

Resetting index重置索引

This solution was proposed by @anky_91.这个解决方案是由@anky_91 提出的。 Reset the index into the data columns, then append to index again after slicing.将索引重置到数据列中,然后在切片后再次附加到索引。

# The levels to slice on, in sorted order
slicing_levels = list(slicing_midx.names)

# The levels not to slice on
non_slicing_levels = [level for level in df.index.names 
    if level not in slicing_levels]

# Reset the unneeded index
res = df.reset_index(non_slicing_levels).loc[
    slicing_midx].set_index(non_slicing_levels, append=True)

This is pretty efficient.这是相当有效的。 The only downside I can think of is that it might mess up a MultiIndex on the columns if there is one (need to check this).我能想到的唯一缺点是,如果有一个 MultiIndex(需要检查这个),它可能会弄乱列上的 MultiIndex。

Index individually and concat单独索引并连接

slicing_midx = pandas.MultiIndex.from_tuples([(0, 1), (1, 0)], 
    names=['level0', 'level1'])
res = pandas.concat([df.loc[idx] for idx in slicing_midx],
    keys=slicing_midx, names=slicing_midx.names)

This works.这有效。 But it can be very slow for large DataFrames because each element must be individually indexed.但是对于大型 DataFrame 来说可能会非常慢,因为每个元素都必须单独编入索引。 It also drops the level names, for some reason.出于某种原因,它还会删除级别名称。

Probably this is the fastest if len(slicing_midx) << len(df)如果len(slicing_midx) << len(df)这可能是最快的

Merge/compare the MultiIndex合并/比较 MultiIndex

This compares the indexes with pandas.merge, masks, and slices.这将索引与 pandas.merge、masks 和 slices 进行比较。 I believe this is the most efficient, but it's also cumbersome.我相信这是最有效的,但它也很麻烦。

slicing_midx = pandas.MultiIndex.from_tuples([(0, 1), (1, 0)], 
    names=['level0', 'level1'])
df1 = slicing_midx.to_frame().reset_index(drop=True)
df2 = df.index.to_frame().reset_index(drop=True)
df1['key'] = 1
mask = ~pandas.merge(
    df2, df1, on=['level0', 'level1'], how='left')[
    'key'].isnull()
res = df.loc[mask.values]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM