简体   繁体   English

从 pandas MultiIndex 中选择列

[英]Selecting columns from pandas MultiIndex

I have DataFrame with MultiIndex columns that looks like this:我有 DataFrame 和 MultiIndex 列,如下所示:

# sample data
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

样本数据

What is the proper, simple way of selecting only specific columns (eg ['a', 'c'] , not a range) from the second level?从第二级仅选择特定列(例如['a', 'c'] ,而不是范围)的正确、简单的方法是什么?

Currently I am doing it like this:目前我是这样做的:

import itertools
tuples = [i for i in itertools.product(['one', 'two'], ['a', 'c'])]
new_index = pd.MultiIndex.from_tuples(tuples)
print(new_index)
data.reindex_axis(new_index, axis=1)

预期结果

It doesn't feel like a good solution, however, because I have to bust out itertools , build another MultiIndex by hand and then reindex (and my actual code is even messier, since the column lists aren't so simple to fetch).然而,它感觉不是一个好的解决方案,因为我必须淘汰itertools ,手动构建另一个 MultiIndex 然后重新索引(我的实际代码甚至更混乱,因为获取列列表不是那么简单)。 I am pretty sure there has to be some ix or xs way of doing this, but everything I tried resulted in errors.我很确定必须有一些ixxs方法可以做到这一点,但我尝试的一切都导致了错误。

The most straightforward way is with .loc :最直接的方法是使用.loc

>>> data.loc[:, (['one', 'two'], ['a', 'b'])]


   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

Remember that [] and () have special meaning when dealing with a MultiIndex object:请记住[]()在处理MultiIndex对象时具有特殊含义:

(...) a tuple is interpreted as one multi-level key (...) 元组被解释为一个多级

(...) a list is used to specify several keys [on the same level ] (...) 一个列表用于指定多个键[在同一级别]

(...) a tuple of lists refer to several values within a level (...) 一个列表元组引用一个级别中的多个值

When we write (['one', 'two'], ['a', 'b']) , the first list inside the tuple specifies all the values we want from the 1st level of the MultiIndex .当我们编写(['one', 'two'], ['a', 'b'])时,元组中的第一个列表指定了MultiIndex第一级中我们想要的所有值。 The second list inside the tuple specifies all the values we want from the 2nd level of the MultiIndex .元组中的第二个列表指定了我们想要的MultiIndex第二级的所有值。

Edit 1: Another possibility is to use slice(None) to specify that we want anything from the first level (works similarly to slicing with : in lists).编辑 1:另一种可能性是使用slice(None)来指定我们想要第一级的任何东西(类似于在列表中使用:进行切片)。 And then specify which columns from the second level we want.然后指定我们想要的第二级中的哪些列。

>>> data.loc[:, (slice(None), ["a", "b"])]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

If the syntax slice(None) does appeal to you, then another possibility is to use pd.IndexSlice , which helps slicing frames with more elaborate indices.如果语法slice(None)确实对您有吸引力,那么另一种可能性是使用pd.IndexSlice ,它有助于使用更精细的索引对帧进行切片。

>>> data.loc[:, pd.IndexSlice[:, ["a", "b"]]]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

When using pd.IndexSlice , we can use : as usual to slice the frame.当使用pd.IndexSlice时,我们可以像往常一样使用:来分割帧。

Source: MultiIndex / Advanced Indexing , How to use slice(None)来源: MultiIndex / Advanced Indexing如何使用slice(None)

It's not great, but maybe:这不是很好,但也许:

>>> data
        one                           two                    
          a         b         c         a         b         c
0 -0.927134 -1.204302  0.711426  0.854065 -0.608661  1.140052
1 -0.690745  0.517359 -0.631856  0.178464 -0.312543 -0.418541
2  1.086432  0.194193  0.808235 -0.418109  1.055057  1.886883
3 -0.373822 -0.012812  1.329105  1.774723 -2.229428 -0.617690
>>> data.loc[:,data.columns.get_level_values(1).isin({"a", "c"})]
        one                 two          
          a         c         a         c
0 -0.927134  0.711426  0.854065  1.140052
1 -0.690745 -0.631856  0.178464 -0.418541
2  1.086432  0.808235 -0.418109  1.886883
3 -0.373822  1.329105  1.774723 -0.617690

would work?会工作?

You can use either, loc or ix I'll show an example with loc :您可以使用locix我将展示一个loc示例:

data.loc[:, [('one', 'a'), ('one', 'c'), ('two', 'a'), ('two', 'c')]]

When you have a MultiIndexed DataFrame, and you want to filter out only some of the columns, you have to pass a list of tuples that match those columns.当您有一个 MultiIndexed DataFrame,并且您只想过滤掉一些列时,您必须传递与这些列匹配的元组列表。 So the itertools approach was pretty much OK, but you don't have to create a new MultiIndex:所以 itertools 方法非常好,但您不必创建新的 MultiIndex:

data.loc[:, list(itertools.product(['one', 'two'], ['a', 'c']))]

I think there is a much better way (now), which is why I bother pulling this question (which was the top google result) out of the shadows:我认为有一个更好的方法(现在),这就是为什么我费心把这个问题(这是谷歌的最高结果)从阴影中拉出来:

data.select(lambda x: x[1] in ['a', 'b'], axis=1)

gives your expected output in a quick and clean one-liner:以快速而干净的单行方式提供您的预期输出:

        one                 two          
          a         b         a         b
0 -0.341326  0.374504  0.534559  0.429019
1  0.272518  0.116542 -0.085850 -0.330562
2  1.982431 -0.420668 -0.444052  1.049747
3  0.162984 -0.898307  1.762208 -0.101360

It is mostly self-explaining, the [1] refers to the level.它主要是不言自明的, [1]指的是水平。

ix and select are deprecated! ixselect已弃用!

The use of pd.IndexSlice makes loc a more preferable option to ix and select . pd.IndexSlice的使用使loc成为比ixselect更可取的选项。


DataFrame.loc with pd.IndexSlice DataFrame.locpd.IndexSlice

# Setup
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame('x', index=range(4), columns=col)
data

  one       two      
    a  b  c   a  b  c
0   x  x  x   x  x  x
1   x  x  x   x  x  x
2   x  x  x   x  x  x
3   x  x  x   x  x  x

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

You can alternatively an axis parameter to loc to make it explicit which axis you're indexing from:您也可以将axis参数设置为loc以明确您从哪个轴索引:

data.loc(axis=1)[pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

MultiIndex.get_level_values

Calling data.columns.get_level_values to filter with loc is another option:调用data.columns.get_level_values来过滤loc是另一种选择:

data.loc[:, data.columns.get_level_values(1).isin(['a', 'c'])]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

This can naturally allow for filtering on any conditional expression on a single level.这自然可以允许在单个级别上过滤任何条件表达式。 Here's a random example with lexicographical filtering:这是一个字典过滤的随机示例:

data.loc[:, data.columns.get_level_values(1) > 'b']

  one two
    c   c
0   x   x
1   x   x
2   x   x
3   x   x

More information on slicing and filtering MultiIndexes can be found at Select rows in pandas MultiIndex DataFrame .可以在 Pandas MultiIndex DataFrame 中的Select rows 中找到有关切片和过滤 MultiIndex 的更多信息。

To select all columns named 'a' and 'c' at the second level of your column indexer, you can use slicers:要在列索引器的第二级选择所有名为'a''c'的列,可以使用切片器:

>>> data.loc[:, (slice(None), ('a', 'c'))]

        one                 two          
          a         c         a         c
0 -0.983172 -2.495022 -0.967064  0.124740
1  0.282661 -0.729463 -0.864767  1.716009
2  0.942445  1.276769 -0.595756 -0.973924
3  2.182908 -0.267660  0.281916 -0.587835

Here you can read more about slicers. 在这里,您可以阅读有关切片机的更多信息。

A slightly easier, to my mind, riff on Marc P. 's answer using slice :在我看来,使用 slice 对Marc P.回答稍微简单一点:

import pandas as pd
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'], ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

        one                 two          
          a         c         a         c
0 -1.731008  0.718260 -1.088025 -1.489936
1 -0.681189  1.055909  1.825839  0.149438
2 -1.674623  0.769062  1.857317  0.756074
3  0.408313  1.291998  0.833145 -0.471879

As of pandas 0.21 or so, .select is deprecated in favour of .loc .从 pandas 0.21 左右开始, 不推荐使用 .select 以支持 .loc

For arbitrary level of the column value对于任意级别的列值

If the level of the column index shall be arbitrary, this might help you a bit:如果列索引的级别是任意的,这可能会对您有所帮助:

class DataFrameMultiColumn(pd.DataFrame) :
    def loc_multicolumn(self, keys):
        depth = lambda L: isinstance(L, list) and max(map(depth, L))+1
        
        result = []
        col = self.columns
        
        # if depth of keys is 1, all keys need to be true
        if depth(keys) == 1:
            for c in col:
                # select all columns which contain all keys
                if set(keys).issubset(set(c)) : 
                    result.append(c)
        # depth of 2 indicates, 
        # the product of all sublists will be formed
        elif depth(keys) == 2 :
            keys = list(itertools.product(*keys)) 
            for c in col:
                for k in keys :
                    # select all columns which contain all keys
                    if set(k).issubset(set(c)) : 
                        result.append(c)
                        
        else :
            raise ValueError("Depth of the keys list exceeds 2")

        # return with .loc command
        return self.loc[:,result]

.loc_multicolumn will return the same as calling .loc but without specifing the level for each key. .loc_multicolumn将返回与调用.loc相同的结果,但不指定每个键的级别。 Please note that this might be a problem is values are the same in multiple column levels!请注意,这可能是一个问题,因为多个列级别的值相同!

Example:例子:

Sample data:样本数据:

np.random.seed(1)
    col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                    ['a', 'b', 'c', 'a', 'b', 'c']])
    data = pd.DataFrame(np.random.randint(0, 10, (4,6)), columns=col)
    data_mc = DataFrameMultiColumn(data)

>>> data_mc
      one       two      
        a  b  c   a  b  c
    0   5  8  9   5  0  0
    1   1  7  6   9  2  4
    2   5  2  4   2  4  7
    3   7  9  1   7  0  6

Cases:案例:

List depth 1 requires all elements in the list be fit.列表深度 1 要求列表中的所有元素都适合。

>>> data_mc.loc_multicolumn(['a', 'one'])
  one
    a
0   5
1   1
2   5
3   7
>>> data_mc.loc_multicolumn(['a', 'b'])

Empty DataFrame
Columns: []

Index: [0, 1, 2, 3]

>>> data_mc.loc_multicolumn(['one','a', 'b'])
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

List depth 2 allows all elements of the Cartesian product of keys list.列表深度 2 允许键列表的笛卡尔积的所有元素。

>>> data_mc.loc_multicolumn([['a', 'b']])
  one    two   
    a  b   a  b
0   5  8   5  0
1   1  7   9  2
2   5  2   2  4
3   7  9   7  0
    
>>> data_mc.loc_multicolumn([['one'],['a', 'b']])
  one   
    a  b
0   5  8
1   1  7
2   5  2
3   7  9

For the last: All combination from list(itertools.product(["one"], ['a', 'b'])) are given if all elements in the combination fits.最后:如果组合中的所有元素都适合,则给出list(itertools.product(["one"], ['a', 'b']))中的所有组合。

使用df.loc(axis="columns") (或df.loc(axis=1)仅访问列并切开:

df.loc(axis="columns")[:, ["a", "c"]]

The .loc[:, list of column tuples] approach given in one of the earlier answers fails in case the multi-index has boolean values, as in the example below:如果多索引具有布尔值,则较早答案之一中给出的 .loc[:, list of column tuples] 方法将失败,如下例所示:

col = pd.MultiIndex.from_arrays([[False, False, True,  True],
                                 [False, True,  False, True]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[(False, True),(True, False)]]

This fails with a ValueError: PandasArray must be 1-dimensional.失败并出现ValueError: PandasArray must be 1-dimensional.

Compare this to the following example, where the index values are strings and not boolean:将此与以下示例进行比较,其中索引值是字符串而不是布尔值:

col = pd.MultiIndex.from_arrays([["False", "False", "True",  "True"],
                                 ["False", "True",  "False", "True"]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[("False", "True"),("True", "False")]]

This works fine.这工作正常。

You can transform the first (boolean) scenario to the second (string) scenario with您可以将第一个(布尔)场景转换为第二个(字符串)场景

data.columns = pd.MultiIndex.from_tuples([(str(i),str(j)) for i,j in data.columns],
    names=data.columns.names)

and then access with string instead of boolean column index values (the names=data.columns.names parameter is optional and not relevant to this example).然后使用字符串而不是布尔列索引值访问( names=data.columns.names参数是可选的,与本示例无关)。 This example has a two-level column index, if you have more levels adjust this code correspondingly.这个例子有一个两级的列索引,如果你有更多的级别,相应地调整这个代码。

Getting a boolean multi-level column index arises, for example, if one does a crosstab where the columns result from two or more comparisons.获取布尔多级列索引会出现,例如,如果一个交叉表中的列是由两个或多个比较产生的。

Two answers are here depending on what is the exact output that you need.这里有两个答案,具体取决于您需要的确切输出。

If you want to get a one leveled dataframe from your selection (which can be sometimes really useful) simply use :如果您想从您的选择中获得一个级别的数据框(有时可能非常有用),只需使用:

df.xs('theColumnYouNeed', level=1, axis=1)

If you want to keep the multiindex form (similar to metakermit's answer) :如果您想保留多索引表单(类似于 metakermit 的答案):

data.loc[:, data.columns.get_level_values(1) == "columnName"]

Hope this will help someone希望这会对某人有所帮助

Rename columns before selecting在选择之前重命名列

  • Sample dataframe样品 dataframe
import pandas as pd
import numpy as np
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
  • rename columns重命名列
data.columns = ['_'.join(x) for x in data.columns]
data
  • Subset column子集列
data['one_a']

One option is with select_columns from pyjanitor , where you can use a dictionary to select - the dictionary option is restricted to MultiIndex only - the key of the dictionary is the level (either a number or label), and the value is the label(s) to be selected:一种选择是使用来自pyjanitorselect_columns ,您可以在其中使用字典到 select - 字典选项仅限于 MultiIndex - 字典的键是级别(数字或标签),值是标签(s) ) 待选:

# pip install pyjanitor
import pandas as pd
import janitor
data.select_columns({1:['a','c']})

        one                 two          
          a         c         a         c
0 -0.089182 -0.523464 -0.494476  0.281698
1  0.968430 -1.900191 -0.207842 -0.623020
2  0.087030 -0.093328 -0.861414 -0.021726
3 -0.952484 -1.149399  0.035582  0.922857

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM