简体   繁体   中英

PANDAS select rows that contain most recent observation of pairing

The Problem:

I am looking to select only the most recent record of price within each pairing of uid and retailer.

The Data:

import pandas as pd
import numpy as np
data = {"uid":{"0":"123","1":"345","2":"678","3":"123","4":"345","5":"123","6":"678","7":"369","8":"890","9":"678"},"retailer":{"0":"GUY","1":"GUY","2":"GUY","3":"GUY","4":"GUY","5":"GAL","6":"GUY","7":"GAL","8":"GAL","9":"GUY"},"upload date":{"0":"11/17/17","1":"11/17/17","2":"11/16/17","3":"11/16/17","4":"11/16/17","5":"11/17/17","6":"11/17/17","7":"11/17/17","8":"11/17/17","9":"11/15/17"},"price":{"0":12.00,"1":1.23, "2":34.00, "3":69.69, "4":13.72, "5":49.98, "6":98.02, "7":1.02,"8":98.23,"9":12.69}}
df = pd.DataFrame(data=data)
df = df[['uid','retailer','upload date','price']]
df['upload date']=pd.to_datetime(df['upload date'])

Solution:

idx = df.groupby(['uid','retailer'])['upload date'].max().rename('upload date')
idx.reset_index(inplace=True)
solution = idx.merge(df, how='left', on=['uid','retailer','upload date'])

Question:

I would like to be able to leverage indices to get to my solution. Either, I'd like to be able to use join, or find the max date of each pairing with a function that retains the indices of the original data frame.

JOIN ERROR:

idx.set_index(['uid','retailer','upload date']).join(df, on=['uid','retailer','upload date'])

Returns:

ValueError: len(left_on) must equal the number of levels in the index of "right"

IIUC, idxmax

df.loc[df.groupby(['uid','retailer'])['upload date'].idxmax()]
Out[168]: 
   uid retailer upload date  price
5  123      GAL  2017-11-17  49.98
0  123      GUY  2017-11-17  12.00
1  345      GUY  2017-11-17   1.23
7  369      GAL  2017-11-17   1.02
6  678      GUY  2017-11-17  98.02
8  890      GAL  2017-11-17  98.23

Or reindex

df.reindex(df.groupby(['uid','retailer'])['upload date'].idxmax().values)

If you want join the document stated :Join columns with other DataFrame either on index or on a key column

idx.set_index(['uid','retailer','upload date']).join(df.set_index(['uid','retailer','upload date']))
Out[175]: 
                          price
uid retailer upload date       
123 GAL      2017-11-17   49.98
    GUY      2017-11-17   12.00
345 GUY      2017-11-17    1.23
369 GAL      2017-11-17    1.02
678 GUY      2017-11-17   98.02
890 GAL      2017-11-17   98.23

To get your expected output you need adding .reset_index() at the end

Or doing something like

idx.join(df.set_index(['uid','retailer','upload date']),on=['uid','retailer','upload date'])
Out[177]: 
   uid retailer upload date  price
0  123      GAL  2017-11-17  49.98
1  123      GUY  2017-11-17  12.00
2  345      GUY  2017-11-17   1.23
3  369      GAL  2017-11-17   1.02
4  678      GUY  2017-11-17  98.02
5  890      GAL  2017-11-17  98.23

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM