简体   繁体   中英

DataFrame filter based on groupby

Here is my simplified example df:

      salesPerson customer measure timeStamp
      --------------------------------------
      A           123      I       12:30
      A           123      II      12:30
      A           123      III     12:30
      B           123      IV      12:35
      C           456      I       14:30
      C           456      II      14:30
      D           456      III     14:15

What I want to do, it to filer the dataframe and in cases when 2 different salesPerson Id's have the same customer number, keep all the rows of the salesPerson whos timeStamp is the earliest. Resulting df in this example would be:

      salesPerson customer measure timeStamp
      --------------------------------------
      A           123      I       12:30
      A           123      II      12:30
      A           123      III     12:30
      D           456      III     14:15

What would be the best/most pythonic way to do it? I thought about using pandas groupby.filter or groupby.transform, but frankly have no idea how to accurately write those.

Bonus points would be for having the deleted rows in a separate deleted_df object.

This one-liner should do the trick:

df[df['salesPerson'].isin(df.iloc[df.groupby(['customer'])['timeStamp'].idxmin(), 'salesPerson'])]

Explanation:

To determine the salespersons to whom we want to filter, first group df by customer and get the index where the minimum timeStamp is found using idxmin :

df.groupby(['customer'])['timeStamp'].idxmin()

Then, pass those index values to iloc , along with the column we want, to get the values from salesPerson we'll use for filtering:

df.iloc[df.groupby(['customer'])['timeStamp'].idxmin(), 'salesPerson']

Finally, pass that result to the Series method isin , and use that to index into df . The result is thus:

0  A  123    I 2017-07-12 12:30:00
1  A  123   II 2017-07-12 12:30:00
2  A  123  III 2017-07-12 12:30:00
6  D  456  III 2017-07-12 14:15:00

To create a second DataFrame with the filtered-out rows, you could pass the index from the filtered df to the original df and exclude those rows. So if we assigned the result above to df1 , we could create a complementary df2 in this manner:

df2 = df[~df.index.isin(df1.index)]

Result:

3  B  123  IV 2017-07-12 12:35:00
4  C  456   I 2017-07-12 14:30:00
5  C  456  II 2017-07-12 14:30:00

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM