在Pandas中对行和列MultiIndex使用布尔索引

Question

Questions are at the end, in bold . 问题最后是粗体。 But first, let's set up some data: 但首先，让我们设置一些数据：

import numpy as np
import pandas as pd
from itertools import product

np.random.seed(1)

team_names = ['Yankees', 'Mets', 'Dodgers']
jersey_numbers = [35, 71, 84]
game_numbers = [1, 2]
observer_names = ['Bill', 'John', 'Ralph']
observation_types = ['Speed', 'Strength']

row_indices = list(product(team_names, jersey_numbers, game_numbers, observer_names, observation_types))
observation_values = np.random.randn(len(row_indices))

tns, jns, gns, ons, ots = zip(*row_indices)

data = pd.DataFrame({'team': tns, 'jersey': jns, 'game': gns, 'observer': ons, 'obstype': ots, 'value': observation_values})

data = data.set_index(['team', 'jersey', 'game', 'observer', 'obstype'])
data = data.unstack(['observer', 'obstype'])
data.columns = data.columns.droplevel(0)

this gives: 这给了：

I want to pluck out a subset of this DataFrame for subsequent analysis. 我想将这个DataFrame的一个子集用于后续分析。 Say I wanted to slice out the rows where the jersey number is 71. I don't really like the idea of using xs to do this. 说我想切出jersey数是71的行。我真的不喜欢用xs来做这个。 When you do a cross section via xs you lose the column you selected on. 当您通过xs执行横截面时，您将丢失所选的列。 If I run: 如果我跑：

data.xs(71, axis=0, level='jersey')

then I get back the right rows, but I lose the jersey column. 然后我回到正确的行，但我失去了jersey列。

xs_slice

Also, xs doesn't seem like a great solution for the case where I want a few different values from the jersey column. 此外，对于我想要来自jersey列的几个不同值的情况， xs似乎不是一个很好的解决方案。 I think a much nicer solution is the one found here : 我认为一个好得多的解决方案是找到了一个在这里：

data[[j in [71, 84] for t, j, g in data.index]]

boolean_slice_1

You could even filter on a combination of jerseys and teams: 你甚至可以过滤球衣和球队的组合：

data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]

boolean_slice_2

Nice! 太好了！

So the question: how can I do something similar for selecting a subset of columns. 所以问题是：如何选择类似的列来选择列的子集。 For example, say I want only the columns representing data from Ralph. 例如，假设我只想要表示Ralph数据的列。 How can I do that without using xs ? 如果不使用xs ，我怎么能这样做？ Or what if I wanted only the columns with observer in ['John', 'Ralph'] ? 或者如果我只想observer in ['John', 'Ralph']使用observer in ['John', 'Ralph']的列呢？ Again, I'd really prefer a solution that keeps all the levels of the row and column indices in the result...just like the boolean indexing examples above. 同样，我真的更喜欢一种解决方案，它将行和列索引的所有级别保留在结果中......就像上面的布尔索引示例一样。

I can do what I want, and even combine selections from both the row and column indices. 我可以做我想要的，甚至组合行和列索引的选择。 But the only solution I've found involves some real gymnastics: 但我发现的唯一解决方案涉及一些真正的体操：

data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
    .T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T

double_boolean_slice

And thus the second question: is there a more compact way to do what I just did above? 因此第二个问题： 是否有一种更紧凑的方式来做我刚刚做的事情？

Answer 1

As of Pandas 0.18 (possibly earlier) you can easily slice multi-indexed DataFrames using pd.IndexSlice . 从Pandas 0.18（可能更早）开始，您可以使用pd.IndexSlice轻松切割多索引DataFrame 。

For your specific question, you can use the following to select by team, jersey, and game: 对于您的具体问题，您可以使用以下内容按球队，球衣和比赛进行选择：

data.loc[pd.IndexSlice[:,[71, 84],:],:] #IndexSlice on the rows

IndexSlice needs just enough level information to be unambiguous so you can drop the trailing colon: IndexSlice只需要足够明确的级别信息，因此您可以删除尾部冒号：

data.loc[pd.IndexSlice[:,[71, 84]],:]

Likewise, you can IndexSlice on columns: 同样，您可以在列上使用IndexSlice：

data.loc[pd.IndexSlice[:,[71, 84]],pd.IndexSlice[['John', 'Ralph']]]

Which gives you the final DataFrame in your question. 这将为您提供问题中的最终DataFrame。

Answer 2

Here is one approach that uses slightly more built-in-feeling syntax. 这是一种使用稍微内置感的语法的方法。 But it's still clunky as hell: 但它仍然笨拙地狱：

data.loc[
    (data.index.get_level_values('jersey').isin([71, 84])
     & data.index.get_level_values('team').isin(['Dodgers', 'Mets'])), 
    data.columns.get_level_values('observer').isin(['John', 'Ralph'])
]

So comparing: 所以比较：

def hackedsyntax():
    return data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
    .T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T

def uglybuiltinsyntax():
    return data.loc[
        (data.index.get_level_values('jersey').isin([71, 84])
         & data.index.get_level_values('team').isin(['Dodgers', 'Mets'])), 
        data.columns.get_level_values('observer').isin(['John', 'Ralph'])
    ]

%timeit hackedsyntax()
%timeit uglybuiltinsyntax()

hackedsyntax() - uglybuiltinsyntax()

results: 结果：

1000 loops, best of 3: 395 µs per loop
1000 loops, best of 3: 409 µs per loop

comparison_of_methods

Still hopeful there's a cleaner or more canonical way to do this. 仍然希望有更清洁或更规范的方式来做到这一点。

Answer 3

Note: Since Pandas v0.20, ix accessor has been deprecated; 注意：自Pandas v0.20以来， ix访问器已被弃用; use loc or iloc instead as appropriate. 根据需要使用loc或iloc 。

If I've understood the question correctly, it's pretty simple: 如果我正确理解了这个问题，那很简单：

To get the column for Ralph: 获取拉尔夫专栏：

data.ix[:,"Ralph"]

to get it for two of them, pass in a list: 得到它们中的两个，传入一个列表：

data.ix[:,["Ralph","John"]]

The ix operator is the power indexing operator. ix运算符是功率索引运算符。 Remember that the first argument is rows, and then columns (as opposed to data[..][..] which is the other way around). 请记住，第一个参数是行，然后是列（而不是数据[..] [..]，这是另一种方式）。 The colon acts as a wildcard, so it returns all the rows in axis=0. 冒号充当通配符，因此它返回axis = 0中的所有行。

In general, to do a look up in a MultiIndex, you should pass in a tuple. 通常，要在MultiIndex中查找，您应该传入一个元组。 eg 例如

data.[:,("Ralph","Speed")]

But if you just pass in a single element, it will treat this as if you're passing in the first element of the tuple and then a wildcard. 但是如果你传入一个单独的元素，它会把它当作传递元组的第一个元素然后传递一个通配符。

Where it gets tricky is if you want to access columns that are not level 0 indices. 如果你想要访问不是0级索引的列，那么它变得棘手。 For example, get all the columns for "speed". 例如，获取“速度”的所有列。 Then you'd need to get a bit more creative.. Use the get_level_values method of index/column in combination with boolean indexing: 然后你需要更有创意..使用索引/列的get_level_values方法结合布尔索引：

For example, this gets jersey 71 in the rows, and strength in the columns: 例如，这会在行中获得jersey 71，并在列中获得strength ：

data.ix[data.index.get_level_values("jersey") == 71 , \
        data.columns.get_level_values("obstype") == "Strength"]

Answer 4

Note that from what I understand, select is slow. 请注意，根据我的理解， select很慢。 But another approach here would be: 但另一种方法是：

data.select(lambda col: col[0] in ['John', 'Ralph'], axis=1)

you can also chain this with a selection against the rows: 你也可以用行选择链接这个：

data.select(lambda col: col[0] in ['John', 'Ralph'], axis=1) \
    .select(lambda row: row[1] in [71, 84] and row[2] > 1, axis=0)

The big drawback here is that you have to know the index level number. 这里最大的缺点是您必须知道索引级别编号。

在Pandas中对行和列MultiIndex使用布尔索引

问题描述

4 个解决方案

解决方案1
2 2018-02-07 18:58:11

解决方案2
1 已采纳 2013-12-24 04:58:14

解决方案3
1 2013-12-24 22:03:44

解决方案4
0 2014-01-02 22:49:27

在Pandas中对行和列MultiIndex使用布尔索引

问题描述

4 个解决方案

解决方案1 2 2018-02-07 18:58:11

解决方案2 1 已采纳 2013-12-24 04:58:14

解决方案3 1 2013-12-24 22:03:44

解决方案4 0 2014-01-02 22:49:27

解决方案1
2 2018-02-07 18:58:11

解决方案2
1 已采纳 2013-12-24 04:58:14

解决方案3
1 2013-12-24 22:03:44

解决方案4
0 2014-01-02 22:49:27