[英]Selecting columns from pandas MultiIndex
I have DataFrame with MultiIndex columns that looks like this:我有 DataFrame 和 MultiIndex 列,如下所示:
# sample data
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
What is the proper, simple way of selecting only specific columns (eg ['a', 'c']
, not a range) from the second level?从第二级仅选择特定列(例如
['a', 'c']
,而不是范围)的正确、简单的方法是什么?
Currently I am doing it like this:目前我是这样做的:
import itertools
tuples = [i for i in itertools.product(['one', 'two'], ['a', 'c'])]
new_index = pd.MultiIndex.from_tuples(tuples)
print(new_index)
data.reindex_axis(new_index, axis=1)
It doesn't feel like a good solution, however, because I have to bust out itertools
, build another MultiIndex by hand and then reindex (and my actual code is even messier, since the column lists aren't so simple to fetch).然而,它感觉不是一个好的解决方案,因为我必须淘汰
itertools
,手动构建另一个 MultiIndex 然后重新索引(我的实际代码甚至更混乱,因为获取列列表不是那么简单)。 I am pretty sure there has to be some ix
or xs
way of doing this, but everything I tried resulted in errors.我很确定必须有一些
ix
或xs
方法可以做到这一点,但我尝试的一切都导致了错误。
The most straightforward way is with .loc
:最直接的方法是使用
.loc
:
>>> data.loc[:, (['one', 'two'], ['a', 'b'])]
one two
a b a b
0 0.4 -0.6 -0.7 0.9
1 0.1 0.4 0.5 -0.3
2 0.7 -1.6 0.7 -0.8
3 -0.9 2.6 1.9 0.6
Remember that []
and ()
have special meaning when dealing with a MultiIndex
object:请记住
[]
和()
在处理MultiIndex
对象时具有特殊含义:
(...) a tuple is interpreted as one multi-level key
(...) 元组被解释为一个多级键
(...) a list is used to specify several keys [on the same level ]
(...) 一个列表用于指定多个键[在同一级别]
(...) a tuple of lists refer to several values within a level
(...) 一个列表元组引用一个级别中的多个值
When we write (['one', 'two'], ['a', 'b'])
, the first list inside the tuple specifies all the values we want from the 1st level of the MultiIndex
.当我们编写
(['one', 'two'], ['a', 'b'])
时,元组中的第一个列表指定了MultiIndex
第一级中我们想要的所有值。 The second list inside the tuple specifies all the values we want from the 2nd level of the MultiIndex
.元组中的第二个列表指定了我们想要的
MultiIndex
第二级的所有值。
Edit 1: Another possibility is to use slice(None)
to specify that we want anything from the first level (works similarly to slicing with :
in lists).编辑 1:另一种可能性是使用
slice(None)
来指定我们想要第一级的任何东西(类似于在列表中使用:
进行切片)。 And then specify which columns from the second level we want.然后指定我们想要的第二级中的哪些列。
>>> data.loc[:, (slice(None), ["a", "b"])]
one two
a b a b
0 0.4 -0.6 -0.7 0.9
1 0.1 0.4 0.5 -0.3
2 0.7 -1.6 0.7 -0.8
3 -0.9 2.6 1.9 0.6
If the syntax slice(None)
does appeal to you, then another possibility is to use pd.IndexSlice
, which helps slicing frames with more elaborate indices.如果语法
slice(None)
确实对您有吸引力,那么另一种可能性是使用pd.IndexSlice
,它有助于使用更精细的索引对帧进行切片。
>>> data.loc[:, pd.IndexSlice[:, ["a", "b"]]]
one two
a b a b
0 0.4 -0.6 -0.7 0.9
1 0.1 0.4 0.5 -0.3
2 0.7 -1.6 0.7 -0.8
3 -0.9 2.6 1.9 0.6
When using pd.IndexSlice
, we can use :
as usual to slice the frame.当使用
pd.IndexSlice
时,我们可以像往常一样使用:
来分割帧。
Source: MultiIndex / Advanced Indexing , How to use slice(None)
来源: MultiIndex / Advanced Indexing , 如何使用
slice(None)
It's not great, but maybe:这不是很好,但也许:
>>> data
one two
a b c a b c
0 -0.927134 -1.204302 0.711426 0.854065 -0.608661 1.140052
1 -0.690745 0.517359 -0.631856 0.178464 -0.312543 -0.418541
2 1.086432 0.194193 0.808235 -0.418109 1.055057 1.886883
3 -0.373822 -0.012812 1.329105 1.774723 -2.229428 -0.617690
>>> data.loc[:,data.columns.get_level_values(1).isin({"a", "c"})]
one two
a c a c
0 -0.927134 0.711426 0.854065 1.140052
1 -0.690745 -0.631856 0.178464 -0.418541
2 1.086432 0.808235 -0.418109 1.886883
3 -0.373822 1.329105 1.774723 -0.617690
would work?会工作?
You can use either, loc
or ix
I'll show an example with loc
:您可以使用
loc
或ix
我将展示一个loc
示例:
data.loc[:, [('one', 'a'), ('one', 'c'), ('two', 'a'), ('two', 'c')]]
When you have a MultiIndexed DataFrame, and you want to filter out only some of the columns, you have to pass a list of tuples that match those columns.当您有一个 MultiIndexed DataFrame,并且您只想过滤掉一些列时,您必须传递与这些列匹配的元组列表。 So the itertools approach was pretty much OK, but you don't have to create a new MultiIndex:
所以 itertools 方法非常好,但您不必创建新的 MultiIndex:
data.loc[:, list(itertools.product(['one', 'two'], ['a', 'c']))]
I think there is a much better way (now), which is why I bother pulling this question (which was the top google result) out of the shadows:我认为有一个更好的方法(现在),这就是为什么我费心把这个问题(这是谷歌的最高结果)从阴影中拉出来:
data.select(lambda x: x[1] in ['a', 'b'], axis=1)
gives your expected output in a quick and clean one-liner:以快速而干净的单行方式提供您的预期输出:
one two
a b a b
0 -0.341326 0.374504 0.534559 0.429019
1 0.272518 0.116542 -0.085850 -0.330562
2 1.982431 -0.420668 -0.444052 1.049747
3 0.162984 -0.898307 1.762208 -0.101360
It is mostly self-explaining, the [1]
refers to the level.它主要是不言自明的,
[1]
指的是水平。
ix
and select
are deprecated! ix
和select
已弃用! The use of pd.IndexSlice
makes loc
a more preferable option to ix
and select
. pd.IndexSlice
的使用使loc
成为比ix
和select
更可取的选项。
DataFrame.loc
with pd.IndexSlice
DataFrame.loc
和pd.IndexSlice
# Setup
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame('x', index=range(4), columns=col)
data
one two
a b c a b c
0 x x x x x x
1 x x x x x x
2 x x x x x x
3 x x x x x x
data.loc[:, pd.IndexSlice[:, ['a', 'c']]]
one two
a c a c
0 x x x x
1 x x x x
2 x x x x
3 x x x x
You can alternatively an axis
parameter to loc
to make it explicit which axis you're indexing from:您也可以将
axis
参数设置为loc
以明确您从哪个轴索引:
data.loc(axis=1)[pd.IndexSlice[:, ['a', 'c']]]
one two
a c a c
0 x x x x
1 x x x x
2 x x x x
3 x x x x
MultiIndex.get_level_values
Calling data.columns.get_level_values
to filter with loc
is another option:调用
data.columns.get_level_values
来过滤loc
是另一种选择:
data.loc[:, data.columns.get_level_values(1).isin(['a', 'c'])]
one two
a c a c
0 x x x x
1 x x x x
2 x x x x
3 x x x x
This can naturally allow for filtering on any conditional expression on a single level.这自然可以允许在单个级别上过滤任何条件表达式。 Here's a random example with lexicographical filtering:
这是一个字典过滤的随机示例:
data.loc[:, data.columns.get_level_values(1) > 'b']
one two
c c
0 x x
1 x x
2 x x
3 x x
More information on slicing and filtering MultiIndexes can be found at Select rows in pandas MultiIndex DataFrame .可以在 Pandas MultiIndex DataFrame 中的Select rows 中找到有关切片和过滤 MultiIndex 的更多信息。
To select all columns named 'a'
and 'c'
at the second level of your column indexer, you can use slicers:要在列索引器的第二级选择所有名为
'a'
和'c'
的列,可以使用切片器:
>>> data.loc[:, (slice(None), ('a', 'c'))]
one two
a c a c
0 -0.983172 -2.495022 -0.967064 0.124740
1 0.282661 -0.729463 -0.864767 1.716009
2 0.942445 1.276769 -0.595756 -0.973924
3 2.182908 -0.267660 0.281916 -0.587835
A slightly easier, to my mind, riff on Marc P. 's answer using slice :在我看来,使用 slice 对Marc P.的回答稍微简单一点:
import pandas as pd
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'], ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data.loc[:, pd.IndexSlice[:, ['a', 'c']]]
one two
a c a c
0 -1.731008 0.718260 -1.088025 -1.489936
1 -0.681189 1.055909 1.825839 0.149438
2 -1.674623 0.769062 1.857317 0.756074
3 0.408313 1.291998 0.833145 -0.471879
As of pandas 0.21 or so, .select is deprecated in favour of .loc .从 pandas 0.21 左右开始, 不推荐使用 .select 以支持 .loc 。
If the level of the column index shall be arbitrary, this might help you a bit:如果列索引的级别是任意的,这可能会对您有所帮助:
class DataFrameMultiColumn(pd.DataFrame) :
def loc_multicolumn(self, keys):
depth = lambda L: isinstance(L, list) and max(map(depth, L))+1
result = []
col = self.columns
# if depth of keys is 1, all keys need to be true
if depth(keys) == 1:
for c in col:
# select all columns which contain all keys
if set(keys).issubset(set(c)) :
result.append(c)
# depth of 2 indicates,
# the product of all sublists will be formed
elif depth(keys) == 2 :
keys = list(itertools.product(*keys))
for c in col:
for k in keys :
# select all columns which contain all keys
if set(k).issubset(set(c)) :
result.append(c)
else :
raise ValueError("Depth of the keys list exceeds 2")
# return with .loc command
return self.loc[:,result]
.loc_multicolumn
will return the same as calling .loc
but without specifing the level for each key. .loc_multicolumn
将返回与调用.loc
相同的结果,但不指定每个键的级别。 Please note that this might be a problem is values are the same in multiple column levels!请注意,这可能是一个问题,因为多个列级别的值相同!
np.random.seed(1)
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randint(0, 10, (4,6)), columns=col)
data_mc = DataFrameMultiColumn(data)
>>> data_mc
one two
a b c a b c
0 5 8 9 5 0 0
1 1 7 6 9 2 4
2 5 2 4 2 4 7
3 7 9 1 7 0 6
List depth 1 requires all elements in the list be fit.列表深度 1 要求列表中的所有元素都适合。
>>> data_mc.loc_multicolumn(['a', 'one'])
one
a
0 5
1 1
2 5
3 7
>>> data_mc.loc_multicolumn(['a', 'b'])
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
>>> data_mc.loc_multicolumn(['one','a', 'b'])
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
List depth 2 allows all elements of the Cartesian product of keys list.列表深度 2 允许键列表的笛卡尔积的所有元素。
>>> data_mc.loc_multicolumn([['a', 'b']])
one two
a b a b
0 5 8 5 0
1 1 7 9 2
2 5 2 2 4
3 7 9 7 0
>>> data_mc.loc_multicolumn([['one'],['a', 'b']])
one
a b
0 5 8
1 1 7
2 5 2
3 7 9
For the last: All combination from list(itertools.product(["one"], ['a', 'b']))
are given if all elements in the combination fits.最后:如果组合中的所有元素都适合,则给出
list(itertools.product(["one"], ['a', 'b']))
中的所有组合。
使用df.loc(axis="columns")
(或df.loc(axis=1)
仅访问列并切开:
df.loc(axis="columns")[:, ["a", "c"]]
The .loc[:, list of column tuples] approach given in one of the earlier answers fails in case the multi-index has boolean values, as in the example below:如果多索引具有布尔值,则较早答案之一中给出的 .loc[:, list of column tuples] 方法将失败,如下例所示:
col = pd.MultiIndex.from_arrays([[False, False, True, True],
[False, True, False, True]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[(False, True),(True, False)]]
This fails with a ValueError: PandasArray must be 1-dimensional.
失败并出现
ValueError: PandasArray must be 1-dimensional.
Compare this to the following example, where the index values are strings and not boolean:将此与以下示例进行比较,其中索引值是字符串而不是布尔值:
col = pd.MultiIndex.from_arrays([["False", "False", "True", "True"],
["False", "True", "False", "True"]])
data = pd.DataFrame(np.random.randn(4, 4), columns=col)
data.loc[:,[("False", "True"),("True", "False")]]
This works fine.这工作正常。
You can transform the first (boolean) scenario to the second (string) scenario with您可以将第一个(布尔)场景转换为第二个(字符串)场景
data.columns = pd.MultiIndex.from_tuples([(str(i),str(j)) for i,j in data.columns],
names=data.columns.names)
and then access with string instead of boolean column index values (the names=data.columns.names
parameter is optional and not relevant to this example).然后使用字符串而不是布尔列索引值访问(
names=data.columns.names
参数是可选的,与本示例无关)。 This example has a two-level column index, if you have more levels adjust this code correspondingly.这个例子有一个两级的列索引,如果你有更多的级别,相应地调整这个代码。
Getting a boolean multi-level column index arises, for example, if one does a crosstab where the columns result from two or more comparisons.获取布尔多级列索引会出现,例如,如果一个交叉表中的列是由两个或多个比较产生的。
Two answers are here depending on what is the exact output that you need.这里有两个答案,具体取决于您需要的确切输出。
If you want to get a one leveled dataframe from your selection (which can be sometimes really useful) simply use :如果您想从您的选择中获得一个级别的数据框(有时可能非常有用),只需使用:
df.xs('theColumnYouNeed', level=1, axis=1)
If you want to keep the multiindex form (similar to metakermit's answer) :如果您想保留多索引表单(类似于 metakermit 的答案):
data.loc[:, data.columns.get_level_values(1) == "columnName"]
Hope this will help someone希望这会对某人有所帮助
import pandas as pd
import numpy as np
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
data.columns = ['_'.join(x) for x in data.columns]
data
data['one_a']
One option is with select_columns from pyjanitor , where you can use a dictionary to select - the dictionary option is restricted to MultiIndex only - the key of the dictionary is the level (either a number or label), and the value is the label(s) to be selected:一种选择是使用来自pyjanitor的select_columns ,您可以在其中使用字典到 select - 字典选项仅限于 MultiIndex - 字典的键是级别(数字或标签),值是标签(s) ) 待选:
# pip install pyjanitor
import pandas as pd
import janitor
data.select_columns({1:['a','c']})
one two
a c a c
0 -0.089182 -0.523464 -0.494476 0.281698
1 0.968430 -1.900191 -0.207842 -0.623020
2 0.087030 -0.093328 -0.861414 -0.021726
3 -0.952484 -1.149399 0.035582 0.922857
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.