简体   繁体   中英

Python DataFrame Multiple Column Groupby With Max Value From First Group

Data Frame is:

df=pd.DataFrame([[1,'a','c1',30,'s1','e1'],[1,'b','c1',60,'s1','e1'],[1,'b','c2',40,'s1','e1'],[2,'g','c1',40,'s2','e2'],[2,'g','c3',9,'s1','e1'],[3,'k','c2',20,'s1','e1'],[3,'k','c2',69,'s2','e1'],[3,'k','c1',29,'s1','e1'],[3,'f','c3',99,'s2','e1']], columns = ['Lot','Item','Code','Date','Shelf','Emp'])
     Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    1    b   c1    60    s1  e1
2    1    b   c2    40    s1  e1
3    2    g   c1    40    s2  e2
4    2    g   c3     9    s1  e1
5    3    k   c2    20    s1  e1
6    3    k   c2    69    s2  e1
7    3    k   c1    29    s1  e1
8    3    f   c3    99    s2  e1

I need to

  1. Group By Lot and then Item
  2. Get the First Item in each Lot
  3. Finally Get the Max Date for each First Item in Each Lot

The OUTPUT should be:

     Lot Item  Code Date Shelf Emp
0    1    a    c1   30   s1    e1
3    2    g    c2   40   s2    e1
6    3    k    c2   69   s2    e1

Thank you !!!

I have done this using a few intermediary steps (there may be a simpler way, but this is how I would do it).

Firstly, group your df by Lot and return the first item for each lot using df.groupby to create a temporary df:

temp_df1 = df.groupby('Lot', as_index=False).first()[['Lot', 'Item']]

   Lot Item
0    1    a
1    2    g
2    3    k

I then used df.merge() to merge the original df onto temp_df to get only the rows from df that contain the first item for each lot:

temp_df2 = df.merge(temp_df1, on=['Lot', 'Item'], how='inner')

     Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    2    g   c1    40    s2  e2
2    2    g   c3     9    s1  e1
3    3    k   c2    20    s1  e1
4    3    k   c2    69    s2  e1
5    3    k   c1    29    s1  e1

Then you can group by Lot and Item on this data frame to get your desired output

df_out = temp_df2.groupby(['Lot', 'Item'], as_index=False).max()

   Lot Item Code  Date Shelf Emp
0    1    a   c1    30    s1  e1
1    2    g   c3    40    s2  e2
2    3    k   c2    69    s2  e1

Full code:

temp_df1 = df.groupby('Lot', as_index=False).first()[['Lot', 'Item']]
temp_df2 = df.merge(temp_df1, on=['Lot', 'Item'], how='inner')
df_out = temp_df2.groupby(['Lot', 'Item'], as_index=False).max()

Or the below without creating temporary dfs:

df_out = (df.merge(df.groupby('Lot', as_index=False).first()[['Lot', 'Item']],
                   on=['Lot', 'Item'], how='inner')
          .groupby(['Lot', 'Item'], as_index=False).max())

Another method:

df1 = df.groupby(['Lot']).first()['Item'].reset_index()
df1['Date'] = [df['Date'].loc[df['Item'] == x].max() for x in df1['Date']]
df1

Result:

index Lot Item Date
0 1 a 30
1 2 g 40
2 3 k 69

Edit: even better using a "mask" and no temp df

m = df.groupby(['Lot']).first()['Item'].tolist()
df[df.Item.isin(m)].groupby(['Lot', 'Item'], as_index=False).max()
index Lot Item Code Date Shelf Emp
0 1 a c1 30 s1 e1
1 2 g c3 40 s2 e2
2 3 k c2 69 s2 e1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM