简体   繁体   中英

How to perform a groupby and select unique in pandas?

I have a flights dataset containing "UNIQUE_CARRIER_NAME", "MONTH_YEAR", "ROUTE" and other attributes such as passenger count, etc. which are not relevant to me in this case. Here is a sample (There are many other carriers and date ranges to 2017):

           UNIQUE_CARRIER_NAME MONTH_YEAR    ROUTE
2512    ATA Airlines d/b/a ATA     2-1990  OGG-HNL
2648    ATA Airlines d/b/a ATA     2-1990  IND-RSW
2649    ATA Airlines d/b/a ATA     2-1990  IND-RSW
2650    ATA Airlines d/b/a ATA     2-1990  IND-RSW
3104    ATA Airlines d/b/a ATA     2-1990  HNL-SFO
3470    ATA Airlines d/b/a ATA     2-1990  SFO-HNL
3482    ATA Airlines d/b/a ATA     2-1990  SFO-OGG
4522    ATA Airlines d/b/a ATA     3-1990  OGG-HNL
5076    ATA Airlines d/b/a ATA     2-1990  RSW-IND
5077    ATA Airlines d/b/a ATA     2-1990  RSW-IND
5078    ATA Airlines d/b/a ATA     2-1990  RSW-IND
5296    ATA Airlines d/b/a ATA     3-1990  RSW-IND
5297    ATA Airlines d/b/a ATA     3-1990  RSW-IND
5371    ATA Airlines d/b/a ATA     3-1990  SFO-HNL
5389    ATA Airlines d/b/a ATA     3-1990  SFO-OGG
....

I want to be able to groupby "UNIQUE_CARRIER_NAME", "MONTH_YEAR", "ROUTE" in this sequence in Python. I have written this:

carrier_groups = df.groupby(["UNIQUE_CARRIER_NAME","MONTH_YEAR","ROUTE])

This returns me a DataFrameGroupBy object which I can use for iterating to perform some calculations on route data -- is there anyway I can choose not to aggregate the data (for the rest of the columns) and just select the unique routes in this groupby function? These 3 rows should be only selected as 1.

2648    ATA Airlines d/b/a ATA     2-1990  IND-RSW
2649    ATA Airlines d/b/a ATA     2-1990  IND-RSW
2650    ATA Airlines d/b/a ATA     2-1990  IND-RSW

I would like to iterate this set of DataFrame grouped by "UNIQUE_CARRIER_NAME", "MONTH_YEAR" such that I have :

for each group of DataFrame:
    I have a subset of df which I can run a function on ROUTE to get some results

No grouping is necessary. Just drop the dupes in the dataframe using:

df = df.drop_duplicates(subset=['UNIQUE_CARRIER_NAME','MONTH_YEAR','ROUTE'])

I think you need drop_duplicates first and then apply your function (only some sample function, because no information about it):

def func(x):
    print (x)
    #apply your function 
    #some sample function 
    x['ROUTE'] = x['ROUTE'] + 'a'
    return x 

df = df.drop_duplicates(['UNIQUE_CARRIER_NAME','MONTH_YEAR','ROUTE'])
df = df.apply(func, axis=1)
print (df)
         UNIQUE_CARRIER_NAME MONTH_YEAR     ROUTE
2512  ATA Airlines d/b/a ATA     2-1990  OGG-HNLa
2648  ATA Airlines d/b/a ATA     2-1990  IND-RSWa
3104  ATA Airlines d/b/a ATA     2-1990  HNL-SFOa
3470  ATA Airlines d/b/a ATA     2-1990  SFO-HNLa
3482  ATA Airlines d/b/a ATA     2-1990  SFO-OGGa
4522  ATA Airlines d/b/a ATA     3-1990  OGG-HNLa
5076  ATA Airlines d/b/a ATA     2-1990  RSW-INDa
5296  ATA Airlines d/b/a ATA     3-1990  RSW-INDa
5371  ATA Airlines d/b/a ATA     3-1990  SFO-HNLa
5389  ATA Airlines d/b/a ATA     3-1990  SFO-OGGa

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM