简体   繁体   中英

pandas GroupBy: How to GroupBy and Aggregate data to show only the top 3 values of a field by count

This is my first question on StackOverflow so I've tried to be as clear and concise as possible. Many thanks for your patience in advance.

Background

I have a dataset of train data with 17 attributes, these include: origin_station_code , origin_station , destination_station_code , destination_station , route_code , start_time , end_time , fleet_number , station_code , station , station_type , platform , sch_arr_time , sch_dep_time , act_arr_time , act_dep_time , date .

Of these attributes, I am only concerned with: date , origin_station , destination_station , and start_time .

This dataset consists of 61 individual CSV files that were combined together to form one DataFrame of just over a million rows using the glob function and a loop.

Each row of the DataFrame represents an individual stop of the train journey. A full route is made of several stops, an example of a route consisting of 19 stops, Sugar Wave to Attempt Pin, is shown in the following screenshot: here .

A new attribute called complete_route name has been created by concatenating the origin_station and destination_station attributes. This can identify all the routes, of which there are 81 unique entries.

The task

My task is to subset the DataFrame using pandas such that the dataset shows the 3 most popular routes, per date. This subset DataFrame should show the date , the complete_route name , and a count of the number times that, that route has taken place each day. The number of unique times a route has taken place can be determined by applying the nunique method to the start_time attribute (date/time type).

My current progress

Currently, my GroupBy and Aggregate code is able to show how many times each route ran per day, as follows:

df_grouped = df.groupby(
   ['date', 'complete_route_name']
).agg(
    {
         'start_time': 'nunique'    # count the number of unique routes by using the 'nunique' of the start_times
    }
).reset_index()

I now however want to now take my existing code so that it only shows the top 3 unique routes by count, per day, eg

date           complete_route_name                                   count
2015-08-01     Attempt Pin to Roll Test                              101
               Suit Treatment Turnback to Spiders Toothbrush         93       
               Concourse Village to Port Morris                      87
2015-08-02     Bridge Bottle to Ants Attempt                         119
               North Riverdale to Eastchester                        117
               Wakefield to Kingsbridge                              101

......

2015-09-30     Castleton Corners to Dongan Hills                     121
               Eltingville to Graniteville                           119
               Great Kills to Castleton                              117

Any help with this would be greatly appreciated!

Additional resources

The original dataset and my workbook in its current state can be found hosted on my GitHub if that is of any use/interest. A static workbook can also be viewed here .

Many thanks!

I will continue from where you left

df_agg = df.groupby(['date', 'route_name']).agg({'start_time':'nunique'}).reset_index()

Then I would do the following to solve for what you asked for

df_sorted_by_group = df_agg.groupby(['date']).apply(
      lambda x: x.sort_values(['start_time'],ascending = False)
      ).reset_index(drop = True)

Final step

df_final = df_sorted_by_group.groupby(['date']).head(3)

Example code

import pandas as pd
routes = {'route_name': [ 'A to B', 'A to B',  'B to C',   'B to C',   'C to D', 'C to D',  'C to D', 'C to D',  'D to E',
                        'A to Z', 'A to Z',  'B to Z',   'B to Z',   'C to Z', 'C to Z',  'C to Z', 'C to Z',  'D to Z'],
'date': ['01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015',
        '02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015'],
'start_time': ['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15','A16','A17','A18']

         }

df = pd.DataFrame(routes)
df['date'] = pd.to_datetime(df['date'],format ='%d/%m/%Y')
df

    route_name  date    start_time
0   A to B  2015-01-01  A1
1   A to B  2015-01-01  A2
2   B to C  2015-01-01  A3
3   B to C  2015-01-01  A4
4   C to D  2015-01-01  A5
5   C to D  2015-01-01  A6
6   C to D  2015-01-01  A7
7   C to D  2015-01-01  A8
8   D to E  2015-01-01  A9
9   A to Z  2015-01-02  A10
10  A to Z  2015-01-02  A11
11  B to Z  2015-01-02  A12
12  B to Z  2015-01-02  A13
13  C to Z  2015-01-02  A14
14  C to Z  2015-01-02  A15
15  C to Z  2015-01-02  A16
16  C to Z  2015-01-02  A17
17  D to Z  2015-01-02  A18

After applying script from above, you get the following results

 df_final
     date   route_name  start_time
0   2015-01-01  C to D  4
1   2015-01-01  A to B  2
2   2015-01-01  B to C  2
4   2015-01-02  C to Z  4
5   2015-01-02  A to Z  2
6   2015-01-02  B to Z  2
df_sorted_by_group = df_grouped.groupby(['Date']).apply(
      lambda x: x.sort_values(['Count'],ascending = False)
      ).reset_index(drop = True)

df_grouped_top16 = df_sorted_by_group.groupby(['Date']).head(16)

Ok, so starting with your working part, I would rewrite it to:

df_grouped = df.groupby(
   ['date', 'complete_route_name'], as_index=False
)['start_time'].nunique()

Next IIUC you can do:

df2=df_grouped.groupby("date").rank().le(3)
df_grouped.loc[df2.loc[df2].index]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM