简体   繁体   中英

Selecting columns from pandas MultiIndex

I have DataFrame with MultiIndex columns that looks like this:

# sample data
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

样本数据

What is the proper, simple way of selecting only specific columns (eg ['a', 'c'] , not a range) from the second level?

Currently I am doing it like this:

import itertools
tuples = [i for i in itertools.product(['one', 'two'], ['a', 'c'])]
new_index = pd.MultiIndex.from_tuples(tuples)
print(new_index)
data.reindex_axis(new_index, axis=1)

预期结果

It doesn't feel like a good solution, however, because I have to bust out itertools , build another MultiIndex by hand and then reindex (and my actual code is even messier, since the column lists aren't so simple to fetch). I am pretty sure there has to be some ix or xs way of doing this, but everything I tried resulted in errors.

The most straightforward way is with .loc :

>>> data.loc[:, (['one', 'two'], ['a', 'b'])]


   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

Remember that [] and () have special meaning when dealing with a MultiIndex object:

(...) a tuple is interpreted as one multi-level key

(...) a list is used to specify several keys [on the same level ]

(...) a tuple of lists refer to several values within a level

When we write (['one', 'two'], ['a', 'b']) , the first list inside the tuple specifies all the values we want from the 1st level of the MultiIndex . The second list inside the tuple specifies all the values we want from the 2nd level of the MultiIndex .

Edit 1: Another possibility is to use slice(None) to specify that we want anything from the first level (works similarly to slicing with : in lists). And then specify which columns from the second level we want.

>>> data.loc[:, (slice(None), ["a", "b"])]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

If the syntax slice(None) does appeal to you, then another possibility is to use pd.IndexSlice , which helps slicing frames with more elaborate indices.

>>> data.loc[:, pd.IndexSlice[:, ["a", "b"]]]

   one       two     
     a    b    a    b
0  0.4 -0.6 -0.7  0.9
1  0.1  0.4  0.5 -0.3
2  0.7 -1.6  0.7 -0.8
3 -0.9  2.6  1.9  0.6

When using pd.IndexSlice , we can use : as usual to slice the frame.

Source: MultiIndex / Advanced Indexing , How to use slice(None)

It's not great, but maybe:

>>> data
        one                           two                    
          a         b         c         a         b         c
0 -0.927134 -1.204302  0.711426  0.854065 -0.608661  1.140052
1 -0.690745  0.517359 -0.631856  0.178464 -0.312543 -0.418541
2  1.086432  0.194193  0.808235 -0.418109  1.055057  1.886883
3 -0.373822 -0.012812  1.329105  1.774723 -2.229428 -0.617690
>>> data.loc[:,data.columns.get_level_values(1).isin({"a", "c"})]
        one                 two          
          a         c         a         c
0 -0.927134  0.711426  0.854065  1.140052
1 -0.690745 -0.631856  0.178464 -0.418541
2  1.086432  0.808235 -0.418109  1.886883
3 -0.373822  1.329105  1.774723 -0.617690

would work?

You can use either, loc or ix I'll show an example with loc :

data.loc[:, [('one', 'a'), ('one', 'c'), ('two', 'a'), ('two', 'c')]]

When you have a MultiIndexed DataFrame, and you want to filter out only some of the columns, you have to pass a list of tuples that match those columns. So the itertools approach was pretty much OK, but you don't have to create a new MultiIndex:

data.loc[:, list(itertools.product(['one', 'two'], ['a', 'c']))]

I think there is a much better way (now), which is why I bother pulling this question (which was the top google result) out of the shadows:

data.select(lambda x: x[1] in ['a', 'b'], axis=1)

gives your expected output in a quick and clean one-liner:

        one                 two          
          a         b         a         b
0 -0.341326  0.374504  0.534559  0.429019
1  0.272518  0.116542 -0.085850 -0.330562
2  1.982431 -0.420668 -0.444052  1.049747
3  0.162984 -0.898307  1.762208 -0.101360

It is mostly self-explaining, the [1] refers to the level.

ix and select are deprecated!

The use of pd.IndexSlice makes loc a more preferable option to ix and select .


DataFrame.loc with pd.IndexSlice

# Setup
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame('x', index=range(4), columns=col)
data

  one       two      
    a  b  c   a  b  c
0   x  x  x   x  x  x
1   x  x  x   x  x  x
2   x  x  x   x  x  x
3   x  x  x   x  x  x

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

You can alternatively an axis parameter to loc to make it explicit which axis you're indexing from:

data.loc(axis=1)[pd.IndexSlice[:, ['a', 'c']]]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

MultiIndex.get_level_values

Calling data.columns.get_level_values to filter with loc is another option:

data.loc[:, data.columns.get_level_values(1).isin(['a', 'c'])]

  one    two   
    a  c   a  c
0   x  x   x  x
1   x  x   x  x
2   x  x   x  x
3   x  x   x  x

This can naturally allow for filtering on any conditional expression on a single level. Here's a random example with lexicographical filtering:

data.loc[:, data.columns.get_level_values(1) > 'b']

  one two
    c   c
0   x   x
1   x   x
2   x   x
3   x   x

More information on slicing and filtering MultiIndexes can be found at Select rows in pandas MultiIndex DataFrame .

To select all columns named 'a' and 'c' at the second level of your column indexer, you can use slicers:

>>> data.loc[:, (slice(None), ('a', 'c'))]

        one                 two          
          a         c         a         c
0 -0.983172 -2.495022 -0.967064  0.124740
1  0.282661 -0.729463 -0.864767  1.716009
2  0.942445  1.276769 -0.595756 -0.973924
3  2.182908 -0.267660  0.281916 -0.587835

Here you can read more about slicers.

A slightly easier, to my mind, riff on Marc P. 's answer using slice :

import pandas as pd
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'], ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)

data.loc[:, pd.IndexSlice[:, ['a', 'c']]]

        one                 two          
          a         c         a         c
0 -1.731008  0.718260 -1.088025 -1.489936
1 -0.681189  1.055909  1.825839  0.149438
2 -1.674623  0.769062  1.857317  0.756074
3  0.408313  1.291998  0.833145 -0.471879

As of pandas 0.21 or so, .select is deprecated in favour of .loc .

For arbitrary level of the column value

If the level of the column index shall be arbitrary, this might help you a bit:

class DataFrameMultiColumn(pd.DataFrame) :
    def loc_multicolumn(self, keys):
        depth = lambda L: isinstance(L, list) and max(map(depth, L))+1
        
        result = []
        col = self.columns
        
        # if depth of keys is 1, all keys need to be true
        if depth(keys) == 1:
            for c in col:
                # select all columns which contain all keys
                if set(keys).issubset(set(c)) : 
                    result.append(c)
        # depth of 2 indicates, 
        # the product of all sublists will be formed
        elif depth(keys) == 2 :
            keys = list(itertools.product(*keys)) 
            for c in col:
                for k in keys :
                    # select all columns which contain all keys
                    if set(k).issubset(set(c)) : 
                        result.append(c)
                        
        else :
            raise ValueError("Depth of the keys list exceeds 2")

        # return with .loc command
        return self.loc[:,result]

.loc_multicolumn will return the same as calling .loc but without specifing the level for each key. Please note that this might be a problem is values are the same in multiple column levels!

Example:

Sample data:

np.random.seed(1)
    col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                    ['a', 'b', 'c', 'a', 'b', 'c']])
    data = pd.DataFrame(np.random.randint(0, 10, (4,6)), columns=col)
    data_mc = DataFrameMultiColumn(data)

>>> data_mc
      one       two      
        a  b  c   a  b  c
    0   5  8  9   5  0  0
    1   1  7  6   9  2  4
    2   5  2  4   2  4  7
    3   7  9  1   7  0  6

Cases:

List depth 1 requires all elements in the list be fit.

>>> data_mc.loc_multicolumn(['a', 'one'])
  one
    a
0   5
1   1
2   5
3   7
>>> data_mc.loc_multicolumn(['a', 'b'])

Empty DataFrame
Columns: []

Index: [0, 1, 2, 3]

>>> data_mc.loc_multicolumn(['one','a', 'b'])
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

List depth 2 allows all elements of the Cartesian product of keys list.

>>> data_mc.loc_multicolumn([['a', 'b']])
  one    two   
    a  b   a  b
0   5  8   5  0
1   1  7   9  2
2   5  2   2  4
3   7  9   7  0
    
>>> data_mc.loc_multicolumn([['one'],['a', 'b']])
  one   
    a  b
0   5  8
1   1  7
2   5  2
3   7  9

For the last: All combination from list(itertools.product(["one"], ['a', 'b'])) are given if all elements in the combination fits.

使用df.loc(axis="columns") (或df.loc(axis=1)仅访问列并切开:

df.loc(axis="columns")[:, ["a", "c"]]

The .loc[:, list of column tuples] approach given in one of the earlier answers fails in case the multi-index has boolean values, as in the example below:

col = pd.MultiIndex.from_arrays([[False, False, True,  True],
                                 [False, True,  False, True]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[(False, True),(True, False)]]

This fails with a ValueError: PandasArray must be 1-dimensional.

Compare this to the following example, where the index values are strings and not boolean:

col = pd.MultiIndex.from_arrays([["False", "False", "True",  "True"],
                                 ["False", "True",  "False", "True"]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[("False", "True"),("True", "False")]]

This works fine.

You can transform the first (boolean) scenario to the second (string) scenario with

data.columns = pd.MultiIndex.from_tuples([(str(i),str(j)) for i,j in data.columns],
    names=data.columns.names)

and then access with string instead of boolean column index values (the names=data.columns.names parameter is optional and not relevant to this example). This example has a two-level column index, if you have more levels adjust this code correspondingly.

Getting a boolean multi-level column index arises, for example, if one does a crosstab where the columns result from two or more comparisons.

Two answers are here depending on what is the exact output that you need.

If you want to get a one leveled dataframe from your selection (which can be sometimes really useful) simply use :

df.xs('theColumnYouNeed', level=1, axis=1)

If you want to keep the multiindex form (similar to metakermit's answer) :

data.loc[:, data.columns.get_level_values(1) == "columnName"]

Hope this will help someone

Rename columns before selecting

  • Sample dataframe
import pandas as pd
import numpy as np
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
  • rename columns
data.columns = ['_'.join(x) for x in data.columns]
data
  • Subset column
data['one_a']

One option is with select_columns from pyjanitor , where you can use a dictionary to select - the dictionary option is restricted to MultiIndex only - the key of the dictionary is the level (either a number or label), and the value is the label(s) to be selected:

# pip install pyjanitor
import pandas as pd
import janitor
data.select_columns({1:['a','c']})

        one                 two          
          a         c         a         c
0 -0.089182 -0.523464 -0.494476  0.281698
1  0.968430 -1.900191 -0.207842 -0.623020
2  0.087030 -0.093328 -0.861414 -0.021726
3 -0.952484 -1.149399  0.035582  0.922857

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM