[英]Pandas Groupby Multiple Columns - Top N
我有一個有趣的! 我試圖找到一個重復的問題,但沒有成功......
我的數據框由2013-2016年的所有美國和地區組成,具有多個屬性。
>>> df.head(2)
state enrollees utilizing enrol_age65 util_age65 year
1 Alabama 637247 635431 473376 474334 2013
2 Alaska 30486 28514 21721 20457 2013
>>> df.tail(2)
state enrollees utilizing enrol_age65 util_age65 year
214 Puerto Rico 581861 579514 453181 450150 2016
215 U.S. Territories 24329 16979 22608 15921 2016
我希望按年份和州分組,並顯示每年的前3個州(通過“登記者”或“利用” - 無關緊要)。
期望的輸出:
enrollees utilizing
year state
2013 California 3933310 3823455
New York 3133980 3002948
Florida 2984799 2847574
...
2016 California 4516216 4365896
Florida 4186823 3984756
New York 4009829 3874682
到目前為止,我已經嘗試了以下內容:
df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)
這只產生GroupBy對象中的前3行:
enrollees utilizing
year state
2013 Alabama 637247 635431
Alaska 30486 28514
Arizona 707683 683273
我也試過一個lambda函數:
df.groupby(['year','state'])['enrollees','utilizing']\
.apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')
這產生了GroupBy對象中絕對最大的3:
enrollees utilizing
year state
2016 California 4516216 4365896
2015 California 4324304 4191704
2014 California 4133532 4011208
我認為這可能與GroupBy對象的索引有關,但我不確定......任何指導都將不勝感激!
好吧,你可以做一些不那么漂亮的事情。
首先使用set()
獲取唯一年份列表:
years_list = list(set(df.year))
創建一個虛擬數據幀和一個函數來連接我過去做過的:
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
創建虛擬最終df
df_final = pd.DataFrame()
現在你將每年循環並進入新的DF:
for year in years_list:
# The query function does a search for where
# the @year means the external variable, in this case the input from loop
# then you'll have a temporary DF with only the year and sorting and getting top3
df2 = df.query("year == @year")
df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
# finally you'll call our function that will keep concating the tmp DFs
df_final = concatenate_loop_dfs(df_temp, df_final)
並做了。
print(df_final)
然后,您需要對GroupBy對象進行排序.sort_values('enrollees), ascending=False
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.