The Problem:
I am looking to select only the most recent record of price within each pairing of uid and retailer.
The Data:
import pandas as pd
import numpy as np
data = {"uid":{"0":"123","1":"345","2":"678","3":"123","4":"345","5":"123","6":"678","7":"369","8":"890","9":"678"},"retailer":{"0":"GUY","1":"GUY","2":"GUY","3":"GUY","4":"GUY","5":"GAL","6":"GUY","7":"GAL","8":"GAL","9":"GUY"},"upload date":{"0":"11/17/17","1":"11/17/17","2":"11/16/17","3":"11/16/17","4":"11/16/17","5":"11/17/17","6":"11/17/17","7":"11/17/17","8":"11/17/17","9":"11/15/17"},"price":{"0":12.00,"1":1.23, "2":34.00, "3":69.69, "4":13.72, "5":49.98, "6":98.02, "7":1.02,"8":98.23,"9":12.69}}
df = pd.DataFrame(data=data)
df = df[['uid','retailer','upload date','price']]
df['upload date']=pd.to_datetime(df['upload date'])
Solution:
idx = df.groupby(['uid','retailer'])['upload date'].max().rename('upload date')
idx.reset_index(inplace=True)
solution = idx.merge(df, how='left', on=['uid','retailer','upload date'])
Question:
I would like to be able to leverage indices to get to my solution. Either, I'd like to be able to use join, or find the max date of each pairing with a function that retains the indices of the original data frame.
JOIN ERROR:
idx.set_index(['uid','retailer','upload date']).join(df, on=['uid','retailer','upload date'])
Returns:
ValueError: len(left_on) must equal the number of levels in the index of "right"
IIUC, idxmax
df.loc[df.groupby(['uid','retailer'])['upload date'].idxmax()]
Out[168]:
uid retailer upload date price
5 123 GAL 2017-11-17 49.98
0 123 GUY 2017-11-17 12.00
1 345 GUY 2017-11-17 1.23
7 369 GAL 2017-11-17 1.02
6 678 GUY 2017-11-17 98.02
8 890 GAL 2017-11-17 98.23
Or reindex
df.reindex(df.groupby(['uid','retailer'])['upload date'].idxmax().values)
If you want join
the document stated :Join columns with other DataFrame either on index or on a key column
idx.set_index(['uid','retailer','upload date']).join(df.set_index(['uid','retailer','upload date']))
Out[175]:
price
uid retailer upload date
123 GAL 2017-11-17 49.98
GUY 2017-11-17 12.00
345 GUY 2017-11-17 1.23
369 GAL 2017-11-17 1.02
678 GUY 2017-11-17 98.02
890 GAL 2017-11-17 98.23
To get your expected output you need adding .reset_index()
at the end
Or doing something like
idx.join(df.set_index(['uid','retailer','upload date']),on=['uid','retailer','upload date'])
Out[177]:
uid retailer upload date price
0 123 GAL 2017-11-17 49.98
1 123 GUY 2017-11-17 12.00
2 345 GUY 2017-11-17 1.23
3 369 GAL 2017-11-17 1.02
4 678 GUY 2017-11-17 98.02
5 890 GAL 2017-11-17 98.23
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.