如何 select 來自多索引 dataframe 的特定列？

Question

播放 kaggle 啤酒評論數據集

https://www.kaggle.com/rdoume/beerreviews

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1504037 entries, 1586613 to 39648
Data columns (total 13 columns):
brewery_id            1504037 non-null int64
brewery_name          1504037 non-null object
review_time           1504037 non-null int64
review_overall        1504037 non-null float64
review_aroma          1504037 non-null float64
review_appearance     1504037 non-null float64
review_profilename    1504037 non-null object
beer_style            1504037 non-null object
review_palate         1504037 non-null float64
review_taste          1504037 non-null float64
beer_name             1504037 non-null object
beer_abv              1504037 non-null float64
beer_beerid           1504037 non-null int64
dtypes: float64(6), int64(3), object(4)
memory usage: 160.6+ MB

我剛剛做了一個 pivot 表並返回以下結果

review_stat_by_beer = df[['beer_name','review_overall','review_aroma','review_appearance','review_palate','review_taste']]\
    .drop_duplicates(['beer_name'])\
    .pivot_table(index="beer_name", aggfunc=("count",'mean','median'))


review_stat_by_beer.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44075 entries, ! (Old Ale) to 葉山ビール (Hayama Beer)
Data columns (total 15 columns):
(review_appearance, count)     44075 non-null int64
(review_appearance, mean)      44075 non-null float64
(review_appearance, median)    44075 non-null float64
(review_aroma, count)          44075 non-null int64
(review_aroma, mean)           44075 non-null float64
(review_aroma, median)         44075 non-null float64
(review_overall, count)        44075 non-null int64
(review_overall, mean)         44075 non-null float64
(review_overall, median)       44075 non-null float64
(review_palate, count)         44075 non-null int64
(review_palate, mean)          44075 non-null float64
(review_palate, median)        44075 non-null float64
(review_taste, count)          44075 non-null int64
(review_taste, mean)           44075 non-null float64
(review_taste, median)         44075 non-null float64
dtypes: float64(10), int64(5)
memory usage: 5.4+ MB

試圖選擇這些列

review_stat_by_beer.(review_appearance, count)  # SyntaxError: invalid syntax

review_stat_by_beer[(review_appearance, count)] #NameError: name 'review_appearance' is not defined

review_stat_by_beer['(review_appearance, count)'] #KeyError: '(review_appearance, count)'

我如何 select 這些 pivot 表結果？ 我的最終目標是在兩列之間進行數學運算：

(review_overall, mean) minus (review_taste, mean)

有什么想法嗎？ 謝謝！

Answer 1

有幾個選項可用於從 multiIndex 中選擇特定結果：

# Setup
df =  pd.DataFrame(np.arange(9).reshape(3, 3))
df.columns = [['A', 'A', 'B'], ['a', 'b', 'c']]
df

   A     B
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8

直接選擇，

df[('A', 'a')]

0    0
1    3
2    6
Name: (A, a), dtype: int64

通過loc ，

df.loc[:, ('A', 'a')]
# or 
# df.loc(axis=1)[('A', 'a')]  

0    0
1    3
2    6
Name: (A, a), dtype: int64

還有xs ，

df.xs(('A', 'a'), axis=1)

0    0
1    3
2    6
Name: (A, a), dtype: int64

在所有這些情況下的想法是傳遞一個字符串元組，分別表示第一級和第二級（您的列索引有 2 個級別）。 在你的情況下，看起來像

review_stat_by_beer[('review_appearance', 'count')]

還有更多方法，但這些是最好的方法。

如何 select 來自多索引 dataframe 的特定列？

問題描述

1 個解決方案

解決方案1
2 已采納 2020-04-20 00:00:26

如何 select 來自多索引 dataframe 的特定列？

問題描述

1 個解決方案

解決方案1 2 已采納 2020-04-20 00:00:26

解決方案1
2 已采納 2020-04-20 00:00:26