簡體   English   中英

我如何在 Python 中的數據幀上使用 groupby 函數

[英]How do i use the groupby function on dataframe in Python

輸入數據框:

      Last Updated  Downloads             Category  
0             2018      10000       ART_AND_DESIGN  
1             2018     500000       ART_AND_DESIGN  
2             2018    5000000       ART_AND_DESIGN  
3             2018   50000000       ART_AND_DESIGN  
4             2018     100000       ART_AND_DESIGN  
           ...        ...                  ...  
10838         2017       1000              MEDICAL  
10839         2015       1000  BOOKS_AND_REFERENCE  
10840         2018   10000000            LIFESTYLE  

問題陳述是:“對於 2016、2017、2018 年,下載次數最多和最少的應用程序類別是什么”

為了解決這個問題,我使用了:

df1 = df_year_d.groupby(['Last Updated','Category']).sum()
print(df1)
                                    Downloads
Last Updated Category                        
2010         FAMILY                    100000
2011         BOOKS_AND_REFERENCE      1000000
             BUSINESS                    1000
             FAMILY                     50000
             GAME                    10100000
             LIBRARIES_AND_DEMO       1000000
             LIFESTYLE                 100000
             TOOLS                    5156100
2012         BUSINESS                   10000
             COMMUNICATION               1000
             FAMILY                    711210
             FINANCE                   100000
             GAME                     1050000
             HEALTH_AND_FITNESS       1100000
             LIBRARIES_AND_DEMO      10000000
             MEDICAL                   120000
             PHOTOGRAPHY               500000
             PRODUCTIVITY              100000
             SHOPPING                  100000
             TOOLS                     200000
2013         BOOKS_AND_REFERENCE         2000
             BUSINESS                   10300
             COMMUNICATION             151000
             EDUCATION                  50000
             FAMILY                  50338310
             FINANCE                    60100
             GAME                    40265250
             HEALTH_AND_FITNESS         10000
             HOUSE_AND_HOME            100000
             LIBRARIES_AND_DEMO       6000000
                                      ...
2018         BOOKS_AND_REFERENCE   1880913110
             BUSINESS               975227003
             COMICS                  55201050
             COMMUNICATION        32548874886
             DATING                 262259557
             EDUCATION              842800000
             ENTERTAINMENT         2836150000
             EVENTS                  15410330
             FAMILY                9020112207
             FINANCE                872763824
             FOOD_AND_DRINK         271663081
             GAME                 33052192901
             HEALTH_AND_FITNESS    1568697276
             HOUSE_AND_HOME         161847101
             LIBRARIES_AND_DEMO      16283100
             LIFESTYLE              468085968
             MAPS_AND_NAVIGATION    702264990
             MEDICAL                 50556517
             NEWS_AND_MAGAZINES    7491323670
             PARENTING               31140010
             PERSONALIZATION       2130701875
             PHOTOGRAPHY           9402062515
             PRODUCTIVITY         13963101723
             SHOPPING              3243802640
             SOCIAL               13924137461
             SPORTS                1540744703
             TOOLS                10633528879
             TRAVEL_AND_LOCAL      6846181981
             VIDEO_PLAYERS         5928936510
             WEATHER                407227020

[188 rows x 1 columns]

現在我需要分別在 2016、2017、2018 三年中分別為 Max 和 Min 的 Category 請提出任何有效的方法來解決 Python 中的此查詢。

首先通過Series.isinboolean indexing過濾,因此只處理必要的行(原因是行處理越少性能越好)。

因為你需要幾年的Category ,首先在聚合sum通過as_index=False創建DataFrame ,然后使用DataFrameGroupBy.idxmaxDataFrameGroupBy.idxmin作為DataFrameGroupBy.idxmin的最小值和最大值的索引,所以可能使用DataFrame.loc進行選擇, DataFrame.stack用於將行轉換為列:

df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()


df1 = df1.loc[df1.groupby('Last Updated')['Downloads'].agg(['idxmin','idxmax']).stack()]
df1.set_index("Last Updated", inplace=True)
df1=df1.loc[['2016','2017','2018']]


print(df1)
    Last Updated             Category  Downloads
7           2016             BUSINESS         10
6           2016  BOOKS_AND_REFERENCE      10000
14          2017         PRODUCTIVITY         10
9           2017       ART_AND_DESIGN    2660000
16          2018    AUTO_AND_VEHICLES        100
15          2018       ART_AND_DESIGN   84345000

另一個想法是按DataFrame.sort_values並使用GroupBy.nthGroupBy.nth的第一行和最后一行:

df1 = df_year_d[df_year_d['Last Updated'].isin([2016,2017,2018])]
df1 = df_year_d.groupby(['Last Updated','Category'], as_index=False).sum()

df1 = (df1.sort_values(['Last Updated','Downloads'])
          .groupby('Last Updated', as_index=False)
          .nth([0,-1]))
print(df1)
    Last Updated           Category  Downloads
7           2016           BUSINESS         10
8           2016            FINANCE      10000
14          2017       PRODUCTIVITY         10
9           2017     ART_AND_DESIGN    2660000
16          2018  AUTO_AND_VEHICLES        100
15          2018     ART_AND_DESIGN   84345000

使用 Sqlalchemy 將數據幀轉換為數據庫的反射,並使用 SQL 查詢來實現結果。

from sqlalchemy import create_engine
df = df.sample(frac=1).reset_index(drop=True)
engine = create_engine('sqlite://',echo = False)
sql = df.to_sql('Table_Name', con = engine)
query=""
engine.execute(query).fetchall()

讓我們通過一個例子來理解。


# Sample dataset --> df

      Last Updated  Downloads Category
0           2016         10    CAT-A
1           2016         20    CAT-A
2           2016         10    CAT-A
3           2016         20    CAT-B
4           2016         30    CAT-B
5           2016         20    CAT-B
6           2016         35    CAT-C
7           2016         20    CAT-C
8           2017         25    CAT-A
9           2017         25    CAT-A
10          2017         30    CAT-A
11          2017         70    CAT-B
12          2017         70    CAT-B
13          2017         80    CAT-B
14          2017         10    CAT-C
15          2017         10    CAT-C
16          2018         15    CAT-A
17          2018         25    CAT-A
18          2018         15    CAT-A
19          2018         20    CAT-B
20          2018         15    CAT-B
21          2018         10    CAT-B
22          2018         90    CAT-C
23          2018        150    CAT-C

# Filtering dataframe on years and summing up all entries category-wise
my_filtered_df = df[df['Last Updated'].between(2016, 2018, inclusive=True)].groupby(['Last Updated','Category']).sum()

                  Downloads
          Last Updated     Category           
2016         CAT-A            40
             CAT-B            70
             CAT-C            55
2017         CAT-A            80
             CAT-B           220
             CAT-C            20
2018         CAT-A            55
             CAT-B            45
             CAT-C           240

min_downloads = my_filtered_df.loc[my_filtered_df.groupby("Last Updated").Downloads.idxmin()].reset_index()

     Last Updated Category  Downloads
0          2016    CAT-A         40
1          2017    CAT-C         20
2          2018    CAT-B         45



max_downloads = my_filtered_df.loc[my_filtered_df.groupby("Last Updated").Downloads.idxmax()].reset_index()

     Last Updated Category  Downloads
0          2016    CAT-B         70
1          2017    CAT-B        220
2          2018    CAT-C        240>

PS:感謝@jezrael 指出之前方法中的缺陷。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM