[英]How to slice into a MultiIndex Pandas DataFrame?
Suppose you have the following data frame:假设您有以下数据框:
In [1]: import pandas as pd
In [2]: index = [('California',2000),('California', 2010), ('New York', 2000),
('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas',2010)]
In [3]: populations = [33871648, 37253956,189765457,19378102,20851820,25145561
...: ]
In [4]: pop_df = pd.DataFrame(populations,index=index,columns=["Data"])
In [5]: pop_df
Out[5]:
Data
(California, 2000) 33871648
(California, 2010) 37253956
(New York, 2000) 189765457
(New York, 2010) 19378102
(Texas, 2000) 20851820
(Texas, 2010) 25145561
How can one index into this dataframe to get all of the California data?如何通过索引 dataframe 获取所有加州数据? I tried the following and got a key error
pop_df[('California,)]
.我尝试了以下并得到一个关键错误
pop_df[('California,)]
。 So then I executed the following and still got a key error:因此,我执行了以下操作,但仍然出现关键错误:
In [6]: index2 = pd.MultiIndex.from_tuples(index)
In [7]: pop_df2 = pop_df.reindex(index2)
In [8]: pop_df2
Out[8]:
Data
California 2000 33871648
2010 37253956
New York 2000 189765457
2010 19378102
Texas 2000 20851820
2010 25145561
In [9]: pop_df2['California']
pop_df2['California']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'California'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-141-18a1a54664b0> in <module>
----> 1 pop_df2['California']
~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]
~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: 'California'
What is the right way to index into a multiindex dataframe?索引到多索引 dataframe 的正确方法是什么?
df['somename']
looks for columns, df.loc['somename']
looks for index. df['somename']
查找列, df.loc['somename']
查找索引。 You want:你要:
pop_df2.loc['California']
Output: Output:
Data
2000 33871648
2010 37253956
You also have xs
option, which allows slicing on different level, and also keeping the full index hierarchy:您还有
xs
选项,它允许在不同级别上进行切片,并保持完整的索引层次结构:
# default `drop_level` is True
# which behave like `.loc` on top level
pop_df.xs('California', level=0, drop_level=False)
Output: Output:
Data
California 2000 33871648
2010 37253956
Or xs
on second level:或第二级的
xs
:
pop_df.xs(2010, level=1, drop_level=False)
gives you:给你:
Data
California 2010 37253956
New York 2010 19378102
Texas 2010 25145561
You want .loc[]
.你想要
.loc[]
。 Without it, you are looking for a column named 'California', not an index label.没有它,您正在寻找名为“California”的列,而不是索引 label。
By the way, you had a typo in your input where you were duplicating an index entry.顺便说一句,您的输入中有一个错字,您正在复制索引条目。 Here is the full code.
这是完整的代码。
In [1]: import pandas as pd
...: index = [
...: ('California',2000),
...: ('California', 2010),
...: ('New York', 2000),
...: ('New York', 2010),
...: ('Texas', 2000),
...: ('Texas',2010)
...: ]
...: populations = [33871648, 37253956,189765457,19378102,20851820,25145561]
...: pop_df = pd.DataFrame(populations,index=index,columns=["Data"])
...: index2 = pd.MultiIndex.from_tuples(index)
...: pop_df2 = pop_df.reindex(index2)
...: pop_df2.loc['California']
Out[1]:
Data
2000 33871648
2010 37253956
Try with IndexSlice
尝试使用
IndexSlice
pop_df2.loc[pd.IndexSlice[['California'],],]
Out[52]:
Data
California 2000 33871648
2010 37253956
Here is a solution.这是一个解决方案。 You need to indicate level:
您需要指明级别:
pop_df2[pop_df2.index.get_level_values(0) == 'California']
#Output:
Data
California 2000 33871648
2010 37253956
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.