Pandas Groupby Multiple Columns - Top N.

Question

我有一個有趣的！ 我試圖找到一個重復的問題，但沒有成功......

我的數據框由2013-2016年的所有美國和地區組成，具有多個屬性。

>>> df.head(2)
     state  enrollees  utilizing  enrol_age65  util_age65  year
1  Alabama     637247     635431       473376      474334  2013
2   Alaska      30486      28514        21721       20457  2013

>>> df.tail(2)
     state               enrollees  utilizing  enrol_age65  util_age65  year
214  Puerto Rico          581861     579514       453181      450150  2016
215  U.S. Territories      24329      16979        22608       15921  2016

我希望按年份和州分組，並顯示每年的前3個州（通過“登記者”或“利用” - 無關緊要）。

期望的輸出：

                                       enrollees  utilizing
year state                                                 
2013 California                          3933310    3823455
     New York                            3133980    3002948
     Florida                             2984799    2847574
...
2016 California                          4516216    4365896
     Florida                             4186823    3984756
     New York                            4009829    3874682

到目前為止，我已經嘗試了以下內容：

df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)

這只產生GroupBy對象中的前3行：

                 enrollees  utilizing
year state                           
2013 Alabama        637247     635431
     Alaska          30486      28514
     Arizona        707683     683273

我也試過一個lambda函數：

df.groupby(['year','state'])['enrollees','utilizing']\
  .apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')

這產生了GroupBy對象中絕對最大的3：

                 enrollees  utilizing
year state                           
2016 California    4516216    4365896
2015 California    4324304    4191704
2014 California    4133532    4011208

我認為這可能與GroupBy對象的索引有關，但我不確定......任何指導都將不勝感激！

Answer 1

好吧，你可以做一些不那么漂亮的事情。

首先使用set()獲取唯一年份列表：

years_list = list(set(df.year))

創建一個虛擬數據幀和一個函數來連接我過去做過的：

def concatenate_loop_dfs(df_temp, df_full, axis=0):
    """
    to avoid retyping the same line of code for every df.
    the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
    values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """ 

if df_full.empty:
    df_full = df_temp
else:
    df_full = pd.concat([df_full, df_temp], axis=axis)

return df_full

創建虛擬最終df

df_final = pd.DataFrame()

現在你將每年循環並進入新的DF：

for year in years_list:
    # The query function does a search for where
    # the @year means the external variable, in this case the input from loop
    # then you'll have a temporary DF with only the year and sorting and getting top3
    df2 = df.query("year == @year")

    df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
    # finally you'll call our function that will keep concating the tmp DFs
    df_final = concatenate_loop_dfs(df_temp, df_final)

並做了。

print(df_final)

Answer 2

然后，您需要對GroupBy對象進行排序.sort_values('enrollees), ascending=False

Pandas Groupby Multiple Columns - Top N.

問題描述

2 個解決方案

解決方案1
2 已采納 2019-02-08 16:46:40

解決方案2
1 2019-02-08 16:21:18

Pandas Groupby Multiple Columns - Top N.

問題描述

2 個解決方案

解決方案1 2 已采納 2019-02-08 16:46:40

解決方案2 1 2019-02-08 16:21:18

解決方案1
2 已采納 2019-02-08 16:46:40

解決方案2
1 2019-02-08 16:21:18