从 pandas MultiIndex 中选择列

Question

I have DataFrame with MultiIndex columns that looks like this:我有 DataFrame 和 MultiIndex 列，如下所示：

# sample data
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

样本数据

What is the proper, simple way of selecting only specific columns (eg ['a', 'c'] , not a range) from the second level?从第二级仅选择特定列（例如['a', 'c'] ，而不是范围）的正确、简单的方法是什么？

Currently I am doing it like this:目前我是这样做的：

import itertools
tuples = [i for i in itertools.product(['one', 'two'], ['a', 'c'])]
new_index = pd.MultiIndex.from_tuples(tuples)
print(new_index)
data.reindex_axis(new_index, axis=1)

预期结果

It doesn't feel like a good solution, however, because I have to bust out itertools , build another MultiIndex by hand and then reindex (and my actual code is even messier, since the column lists aren't so simple to fetch).然而，它感觉不是一个好的解决方案，因为我必须淘汰itertools ，手动构建另一个 MultiIndex 然后重新索引（我的实际代码甚至更混乱，因为获取列列表不是那么简单）。 I am pretty sure there has to be some ix or xs way of doing this, but everything I tried resulted in errors.我很确定必须有一些ix或xs方法可以做到这一点，但我尝试的一切都导致了错误。

Answer 1

The most straightforward way is with .loc :最直接的方法是使用.loc ：

>>> data.loc[:, (['one', 'two'], ['a', 'b'])]


   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

Remember that [] and () have special meaning when dealing with a MultiIndex object:请记住[]和()在处理MultiIndex对象时具有特殊含义：

(...) a tuple is interpreted as one multi-level key (...) 元组被解释为一个多级键

(...) a list is used to specify several keys [on the same level ] (...) 一个列表用于指定多个键[在同一级别]

(...) a tuple of lists refer to several values within a level (...) 一个列表元组引用一个级别中的多个值

When we write (['one', 'two'], ['a', 'b']) , the first list inside the tuple specifies all the values we want from the 1st level of the MultiIndex .当我们编写(['one', 'two'], ['a', 'b'])时，元组中的第一个列表指定了MultiIndex第一级中我们想要的所有值。 The second list inside the tuple specifies all the values we want from the 2nd level of the MultiIndex .元组中的第二个列表指定了我们想要的MultiIndex第二级的所有值。

Edit 1: Another possibility is to use slice(None) to specify that we want anything from the first level (works similarly to slicing with : in lists).编辑 1：另一种可能性是使用slice(None)来指定我们想要第一级的任何东西（类似于在列表中使用:进行切片）。 And then specify which columns from the second level we want.然后指定我们想要的第二级中的哪些列。

>>> data.loc[:, (slice(None), ["a", "b"])]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

If the syntax slice(None) does appeal to you, then another possibility is to use pd.IndexSlice , which helps slicing frames with more elaborate indices.如果语法slice(None)确实对您有吸引力，那么另一种可能性是使用pd.IndexSlice ，它有助于使用更精细的索引对帧进行切片。

>>> data.loc[:, pd.IndexSlice[:, ["a", "b"]]]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

When using pd.IndexSlice , we can use : as usual to slice the frame.当使用pd.IndexSlice时，我们可以像往常一样使用:来分割帧。

Source: MultiIndex / Advanced Indexing , How to use slice(None)来源： MultiIndex / Advanced Indexing ，如何使用slice(None)

Answer 2

It's not great, but maybe:这不是很好，但也许：

>>> data
        one                           two                    
          a         b         c         a         b         c
0 -0.927134 -1.204302  0.711426  0.854065 -0.608661  1.140052
1 -0.690745  0.517359 -0.631856  0.178464 -0.312543 -0.418541
2  1.086432  0.194193  0.808235 -0.418109  1.055057  1.886883
3 -0.373822 -0.012812  1.329105  1.774723 -2.229428 -0.617690
>>> data.loc[:,data.columns.get_level_values(1).isin({"a", "c"})]
        one                 two          
          a         c         a         c
0 -0.927134  0.711426  0.854065  1.140052
1 -0.690745 -0.631856  0.178464 -0.418541
2  1.086432  0.808235 -0.418109  1.886883
3 -0.373822  1.329105  1.774723 -0.617690

would work?会工作？

Answer 3

You can use either, loc or ix I'll show an example with loc :您可以使用loc或ix我将展示一个loc示例：

data.loc[:, [('one', 'a'), ('one', 'c'), ('two', 'a'), ('two', 'c')]]

When you have a MultiIndexed DataFrame, and you want to filter out only some of the columns, you have to pass a list of tuples that match those columns.当您有一个 MultiIndexed DataFrame，并且您只想过滤掉一些列时，您必须传递与这些列匹配的元组列表。 So the itertools approach was pretty much OK, but you don't have to create a new MultiIndex:所以 itertools 方法非常好，但您不必创建新的 MultiIndex：

data.loc[:, list(itertools.product(['one', 'two'], ['a', 'c']))]

Answer 4

I think there is a much better way (now), which is why I bother pulling this question (which was the top google result) out of the shadows:我认为有一个更好的方法（现在），这就是为什么我费心把这个问题（这是谷歌的最高结果）从阴影中拉出来：

data.select(lambda x: x[1] in ['a', 'b'], axis=1)

gives your expected output in a quick and clean one-liner:以快速而干净的单行方式提供您的预期输出：

        one                 two          
          a         b         a         b
0 -0.341326  0.374504  0.534559  0.429019
1  0.272518  0.116542 -0.085850 -0.330562
2  1.982431 -0.420668 -0.444052  1.049747
3  0.162984 -0.898307  1.762208 -0.101360

It is mostly self-explaining, the [1] refers to the level.它主要是不言自明的， [1]指的是水平。

Answer 5

`ix` and `select` are deprecated! `ix`和`select`已弃用！

The use of pd.IndexSlice makes loc a more preferable option to ix and select . pd.IndexSlice的使用使loc成为比ix和select更可取的选项。

`DataFrame.loc` with `pd.IndexSlice` `DataFrame.loc`和`pd.IndexSlice`

# Setup
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame('x', index=range(4), columns=col)
data

  one       two      
    a  b  c   a  b  c
0   x  x  x   x  x  x
1   x  x  x   x  x  x
2   x  x  x   x  x  x
3   x  x  x   x  x  x

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

You can alternatively an axis parameter to loc to make it explicit which axis you're indexing from:您也可以将axis参数设置为loc以明确您从哪个轴索引：

data.loc(axis=1)[pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

`MultiIndex.get_level_values`

Calling data.columns.get_level_values to filter with loc is another option:调用data.columns.get_level_values来过滤loc是另一种选择：

data.loc[:, data.columns.get_level_values(1).isin(['a', 'c'])]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

This can naturally allow for filtering on any conditional expression on a single level.这自然可以允许在单个级别上过滤任何条件表达式。 Here's a random example with lexicographical filtering:这是一个字典过滤的随机示例：

data.loc[:, data.columns.get_level_values(1) > 'b']

  one two
    c   c
0   x   x
1   x   x
2   x   x
3   x   x

More information on slicing and filtering MultiIndexes can be found at Select rows in pandas MultiIndex DataFrame .可以在 Pandas MultiIndex DataFrame 中的Select rows 中找到有关切片和过滤 MultiIndex 的更多信息。

Answer 6

To select all columns named 'a' and 'c' at the second level of your column indexer, you can use slicers:要在列索引器的第二级选择所有名为'a'和'c'的列，可以使用切片器：

>>> data.loc[:, (slice(None), ('a', 'c'))]

        one                 two          
          a         c         a         c
0 -0.983172 -2.495022 -0.967064  0.124740
1  0.282661 -0.729463 -0.864767  1.716009
2  0.942445  1.276769 -0.595756 -0.973924
3  2.182908 -0.267660  0.281916 -0.587835

Here you can read more about slicers. 在这里，您可以阅读有关切片机的更多信息。

Answer 7

A slightly easier, to my mind, riff on Marc P. 's answer using slice :在我看来，使用 slice 对Marc P.的回答稍微简单一点：

import pandas as pd
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'], ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

        one                 two          
          a         c         a         c
0 -1.731008  0.718260 -1.088025 -1.489936
1 -0.681189  1.055909  1.825839  0.149438
2 -1.674623  0.769062  1.857317  0.756074
3  0.408313  1.291998  0.833145 -0.471879

As of pandas 0.21 or so, .select is deprecated in favour of .loc .从 pandas 0.21 左右开始，不推荐使用 .select 以支持 .loc 。

Answer 8

For arbitrary level of the column value对于任意级别的列值

If the level of the column index shall be arbitrary, this might help you a bit:如果列索引的级别是任意的，这可能会对您有所帮助：

class DataFrameMultiColumn(pd.DataFrame) :
    def loc_multicolumn(self, keys):
        depth = lambda L: isinstance(L, list) and max(map(depth, L))+1
        
        result = []
        col = self.columns
        
        # if depth of keys is 1, all keys need to be true
        if depth(keys) == 1:
            for c in col:
                # select all columns which contain all keys
                if set(keys).issubset(set(c)) : 
                    result.append(c)
        # depth of 2 indicates, 
        # the product of all sublists will be formed
        elif depth(keys) == 2 :
            keys = list(itertools.product(*keys)) 
            for c in col:
                for k in keys :
                    # select all columns which contain all keys
                    if set(k).issubset(set(c)) : 
                        result.append(c)
                        
        else :
            raise ValueError("Depth of the keys list exceeds 2")

        # return with .loc command
        return self.loc[:,result]

.loc_multicolumn will return the same as calling .loc but without specifing the level for each key. .loc_multicolumn将返回与调用.loc相同的结果，但不指定每个键的级别。 Please note that this might be a problem is values are the same in multiple column levels!请注意，这可能是一个问题，因为多个列级别的值相同！

Example:例子：

Sample data:样本数据：

np.random.seed(1)
    col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                    ['a', 'b', 'c', 'a', 'b', 'c']])
    data = pd.DataFrame(np.random.randint(0, 10, (4,6)), columns=col)
    data_mc = DataFrameMultiColumn(data)

>>> data_mc
      one       two      
        a  b  c   a  b  c
    0   5  8  9   5  0  0
    1   1  7  6   9  2  4
    2   5  2  4   2  4  7
    3   7  9  1   7  0  6

Cases:案例：

List depth 1 requires all elements in the list be fit.列表深度 1 要求列表中的所有元素都适合。

>>> data_mc.loc_multicolumn(['a', 'one'])
  one
    a
0   5
1   1
2   5
3   7
>>> data_mc.loc_multicolumn(['a', 'b'])

Empty DataFrame
Columns: []

Index: [0, 1, 2, 3]

>>> data_mc.loc_multicolumn(['one','a', 'b'])
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

List depth 2 allows all elements of the Cartesian product of keys list.列表深度 2 允许键列表的笛卡尔积的所有元素。

>>> data_mc.loc_multicolumn([['a', 'b']])
  one    two   
    a  b   a  b
0   5  8   5  0
1   1  7   9  2
2   5  2   2  4
3   7  9   7  0
    
>>> data_mc.loc_multicolumn([['one'],['a', 'b']])
  one   
    a  b
0   5  8
1   1  7
2   5  2
3   7  9

For the last: All combination from list(itertools.product(["one"], ['a', 'b'])) are given if all elements in the combination fits.最后：如果组合中的所有元素都适合，则给出list(itertools.product(["one"], ['a', 'b']))中的所有组合。

Answer 9

使用df.loc(axis="columns") （或df.loc(axis=1)仅访问列并切开：

df.loc(axis="columns")[:, ["a", "c"]]

Answer 10

The .loc[:, list of column tuples] approach given in one of the earlier answers fails in case the multi-index has boolean values, as in the example below:如果多索引具有布尔值，则较早答案之一中给出的 .loc[:, list of column tuples] 方法将失败，如下例所示：

col = pd.MultiIndex.from_arrays([[False, False, True,  True],
                                 [False, True,  False, True]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[(False, True),(True, False)]]

This fails with a ValueError: PandasArray must be 1-dimensional.失败并出现ValueError: PandasArray must be 1-dimensional.

Compare this to the following example, where the index values are strings and not boolean:将此与以下示例进行比较，其中索引值是字符串而不是布尔值：

col = pd.MultiIndex.from_arrays([["False", "False", "True",  "True"],
                                 ["False", "True",  "False", "True"]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[("False", "True"),("True", "False")]]

This works fine.这工作正常。

You can transform the first (boolean) scenario to the second (string) scenario with您可以将第一个（布尔）场景转换为第二个（字符串）场景

data.columns = pd.MultiIndex.from_tuples([(str(i),str(j)) for i,j in data.columns],
    names=data.columns.names)

and then access with string instead of boolean column index values (the names=data.columns.names parameter is optional and not relevant to this example).然后使用字符串而不是布尔列索引值访问（ names=data.columns.names参数是可选的，与本示例无关）。 This example has a two-level column index, if you have more levels adjust this code correspondingly.这个例子有一个两级的列索引，如果你有更多的级别，相应地调整这个代码。

Getting a boolean multi-level column index arises, for example, if one does a crosstab where the columns result from two or more comparisons.获取布尔多级列索引会出现，例如，如果一个交叉表中的列是由两个或多个比较产生的。

Answer 11

Two answers are here depending on what is the exact output that you need.这里有两个答案，具体取决于您需要的确切输出。

If you want to get a one leveled dataframe from your selection (which can be sometimes really useful) simply use :如果您想从您的选择中获得一个级别的数据框（有时可能非常有用），只需使用：

df.xs('theColumnYouNeed', level=1, axis=1)

If you want to keep the multiindex form (similar to metakermit's answer) :如果您想保留多索引表单（类似于 metakermit 的答案）：

data.loc[:, data.columns.get_level_values(1) == "columnName"]

Hope this will help someone希望这会对某人有所帮助

Answer 12

Rename columns before selecting在选择之前重命名列

Sample dataframe样品 dataframe

import pandas as pd
import numpy as np
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

rename columns重命名列

data.columns = ['_'.join(x) for x in data.columns]
data

Subset column子集列

data['one_a']

Answer 13

One option is with select_columns from pyjanitor , where you can use a dictionary to select - the dictionary option is restricted to MultiIndex only - the key of the dictionary is the level (either a number or label), and the value is the label(s) to be selected:一种选择是使用来自pyjanitor的select_columns ，您可以在其中使用字典到 select - 字典选项仅限于 MultiIndex - 字典的键是级别（数字或标签），值是标签（s） ) 待选：

# pip install pyjanitor
import pandas as pd
import janitor
data.select_columns({1:['a','c']})

        one                 two          
          a         c         a         c
0 -0.089182 -0.523464 -0.494476  0.281698
1  0.968430 -1.900191 -0.207842 -0.623020
2  0.087030 -0.093328 -0.861414 -0.021726
3 -0.952484 -1.149399  0.035582  0.922857

从 pandas MultiIndex 中选择列

问题描述

13 个解决方案

解决方案1
33 2019-07-11 00:22:08

解决方案2
27 已采纳 2013-08-27 16:22:58

解决方案3
19 2013-08-27 16:16:56

解决方案4
17 2015-10-11 18:19:03

解决方案5
14 2019-01-23 23:00:20

`ix` and `select` are deprecated! `ix`和`select`已弃用！

`DataFrame.loc` with `pd.IndexSlice` `DataFrame.loc`和`pd.IndexSlice`

`MultiIndex.get_level_values`

解决方案6
11 2016-06-17 03:43:55

解决方案7
3 2018-08-22 12:51:17

解决方案8
1 2022-11-14 11:22:33

For arbitrary level of the column value对于任意级别的列值

Example:例子：

Sample data:样本数据：

Cases:案例：

解决方案9
0 2022-03-29 23:24:01

解决方案10
0 2022-06-02 23:24:17

解决方案11
0 2022-06-28 08:26:29

解决方案12
0 2022-08-22 07:19:04

Rename columns before selecting在选择之前重命名列

解决方案13
0 2022-11-15 21:11:49

从 pandas MultiIndex 中选择列

问题描述

13 个解决方案

解决方案1 33 2019-07-11 00:22:08

解决方案2 27 已采纳 2013-08-27 16:22:58

解决方案3 19 2013-08-27 16:16:56

解决方案4 17 2015-10-11 18:19:03

解决方案5 14 2019-01-23 23:00:20

ix and select are deprecated! ix和select已弃用！

DataFrame.loc with pd.IndexSlice DataFrame.loc和pd.IndexSlice

MultiIndex.get_level_values

解决方案6 11 2016-06-17 03:43:55

解决方案7 3 2018-08-22 12:51:17

解决方案8 1 2022-11-14 11:22:33

For arbitrary level of the column value对于任意级别的列值

Example:例子：

Sample data:样本数据：

Cases:案例：

解决方案9 0 2022-03-29 23:24:01

解决方案10 0 2022-06-02 23:24:17

解决方案11 0 2022-06-28 08:26:29

解决方案12 0 2022-08-22 07:19:04

Rename columns before selecting在选择之前重命名列

解决方案13 0 2022-11-15 21:11:49

解决方案1
33 2019-07-11 00:22:08

解决方案2
27 已采纳 2013-08-27 16:22:58

解决方案3
19 2013-08-27 16:16:56

解决方案4
17 2015-10-11 18:19:03

解决方案5
14 2019-01-23 23:00:20

`ix` and `select` are deprecated! `ix`和`select`已弃用！

`DataFrame.loc` with `pd.IndexSlice` `DataFrame.loc`和`pd.IndexSlice`

`MultiIndex.get_level_values`

解决方案6
11 2016-06-17 03:43:55

解决方案7
3 2018-08-22 12:51:17

解决方案8
1 2022-11-14 11:22:33

解决方案9
0 2022-03-29 23:24:01

解决方案10
0 2022-06-02 23:24:17

解决方案11
0 2022-06-28 08:26:29

解决方案12
0 2022-08-22 07:19:04

解决方案13
0 2022-11-15 21:11:49