[英]How do i use the groupby function on dataframe in Python
輸入數據框:
Last Updated Downloads Category
0 2018 10000 ART_AND_DESIGN
1 2018 500000 ART_AND_DESIGN
2 2018 5000000 ART_AND_DESIGN
3 2018 50000000 ART_AND_DESIGN
4 2018 100000 ART_AND_DESIGN
... ... ...
10838 2017 1000 MEDICAL
10839 2015 1000 BOOKS_AND_REFERENCE
10840 2018 10000000 LIFESTYLE
問題陳述是:“對於 2016、2017、2018 年,下載次數最多和最少的應用程序類別是什么”
為了解決這個問題,我使用了:
df1 = df_year_d.groupby(['Last Updated','Category']).sum()
print(df1)
Downloads
Last Updated Category
2010 FAMILY 100000
2011 BOOKS_AND_REFERENCE 1000000
BUSINESS 1000
FAMILY 50000
GAME 10100000
LIBRARIES_AND_DEMO 1000000
LIFESTYLE 100000
TOOLS 5156100
2012 BUSINESS 10000
COMMUNICATION 1000
FAMILY 711210
FINANCE 100000
GAME 1050000
HEALTH_AND_FITNESS 1100000
LIBRARIES_AND_DEMO 10000000
MEDICAL 120000
PHOTOGRAPHY 500000
PRODUCTIVITY 100000
SHOPPING 100000
TOOLS 200000
2013 BOOKS_AND_REFERENCE 2000
BUSINESS 10300
COMMUNICATION 151000
EDUCATION 50000
FAMILY 50338310
FINANCE 60100
GAME 40265250
HEALTH_AND_FITNESS 10000
HOUSE_AND_HOME 100000
LIBRARIES_AND_DEMO 6000000
...
2018 BOOKS_AND_REFERENCE 1880913110
BUSINESS 975227003
COMICS 55201050
COMMUNICATION 32548874886
DATING 262259557
EDUCATION 842800000
ENTERTAINMENT 2836150000
EVENTS 15410330
FAMILY 9020112207
FINANCE 872763824
FOOD_AND_DRINK 271663081
GAME 33052192901
HEALTH_AND_FITNESS 1568697276
HOUSE_AND_HOME 161847101
LIBRARIES_AND_DEMO 16283100
LIFESTYLE 468085968
MAPS_AND_NAVIGATION 702264990
MEDICAL 50556517
NEWS_AND_MAGAZINES 7491323670
PARENTING 31140010
PERSONALIZATION 2130701875
PHOTOGRAPHY 9402062515
PRODUCTIVITY 13963101723
SHOPPING 3243802640
SOCIAL 13924137461
SPORTS 1540744703
TOOLS 10633528879
TRAVEL_AND_LOCAL 6846181981
VIDEO_PLAYERS 5928936510
WEATHER 407227020
[188 rows x 1 columns]
現在我需要分別在 2016、2017、2018 三年中分別為 Max 和 Min 的 Category 請提出任何有效的方法來解決 Python 中的此查詢。
首先通過Series.isin
和boolean indexing
過濾,因此只處理必要的行(原因是行處理越少性能越好)。
因為你需要幾年的Category
,首先在聚合sum
通過as_index=False
創建DataFrame
,然后使用DataFrameGroupBy.idxmax
和DataFrameGroupBy.idxmin
作為DataFrameGroupBy.idxmin
的最小值和最大值的索引,所以可能使用DataFrame.loc
進行選擇, DataFrame.stack
用於將行轉換為列:
df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()
df1 = df1.loc[df1.groupby('Last Updated')['Downloads'].agg(['idxmin','idxmax']).stack()]
df1.set_index("Last Updated", inplace=True)
df1=df1.loc[['2016','2017','2018']]
print(df1)
Last Updated Category Downloads
7 2016 BUSINESS 10
6 2016 BOOKS_AND_REFERENCE 10000
14 2017 PRODUCTIVITY 10
9 2017 ART_AND_DESIGN 2660000
16 2018 AUTO_AND_VEHICLES 100
15 2018 ART_AND_DESIGN 84345000
另一個想法是按DataFrame.sort_values
並使用GroupBy.nth
為GroupBy.nth
的第一行和最后一行:
df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()
df1 = (df1.sort_values(['Last Updated','Downloads'])
.groupby('Last Updated', as_index=False)
.nth([0,-1]))
print(df1)
Last Updated Category Downloads
7 2016 BUSINESS 10
8 2016 FINANCE 10000
14 2017 PRODUCTIVITY 10
9 2017 ART_AND_DESIGN 2660000
16 2018 AUTO_AND_VEHICLES 100
15 2018 ART_AND_DESIGN 84345000
使用 Sqlalchemy 將數據幀轉換為數據庫的反射,並使用 SQL 查詢來實現結果。
from sqlalchemy import create_engine
df = df.sample(frac=1).reset_index(drop=True)
engine = create_engine('sqlite://',echo = False)
sql = df.to_sql('Table_Name', con = engine)
query=""
engine.execute(query).fetchall()
讓我們通過一個例子來理解。
# Sample dataset --> df
Last Updated Downloads Category
0 2016 10 CAT-A
1 2016 20 CAT-A
2 2016 10 CAT-A
3 2016 20 CAT-B
4 2016 30 CAT-B
5 2016 20 CAT-B
6 2016 35 CAT-C
7 2016 20 CAT-C
8 2017 25 CAT-A
9 2017 25 CAT-A
10 2017 30 CAT-A
11 2017 70 CAT-B
12 2017 70 CAT-B
13 2017 80 CAT-B
14 2017 10 CAT-C
15 2017 10 CAT-C
16 2018 15 CAT-A
17 2018 25 CAT-A
18 2018 15 CAT-A
19 2018 20 CAT-B
20 2018 15 CAT-B
21 2018 10 CAT-B
22 2018 90 CAT-C
23 2018 150 CAT-C
# Filtering dataframe on years and summing up all entries category-wise
my_filtered_df = df[df['Last Updated'].between(2016, 2018, inclusive=True)].groupby(['Last Updated','Category']).sum()
Downloads
Last Updated Category
2016 CAT-A 40
CAT-B 70
CAT-C 55
2017 CAT-A 80
CAT-B 220
CAT-C 20
2018 CAT-A 55
CAT-B 45
CAT-C 240
min_downloads = my_filtered_df.loc[my_filtered_df.groupby("Last Updated").Downloads.idxmin()].reset_index()
Last Updated Category Downloads
0 2016 CAT-A 40
1 2017 CAT-C 20
2 2018 CAT-B 45
max_downloads = my_filtered_df.loc[my_filtered_df.groupby("Last Updated").Downloads.idxmax()].reset_index()
Last Updated Category Downloads
0 2016 CAT-B 70
1 2017 CAT-B 220
2 2018 CAT-C 240>
PS:感謝@jezrael 指出之前方法中的缺陷。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.