简体   繁体   English

PANDAS选择包含最近配对观察的行

[英]PANDAS select rows that contain most recent observation of pairing

The Problem: 问题:

I am looking to select only the most recent record of price within each pairing of uid and retailer. 我希望在每个uid和零售商对中仅选择价格的最新记录。

The Data: 数据:

import pandas as pd
import numpy as np
data = {"uid":{"0":"123","1":"345","2":"678","3":"123","4":"345","5":"123","6":"678","7":"369","8":"890","9":"678"},"retailer":{"0":"GUY","1":"GUY","2":"GUY","3":"GUY","4":"GUY","5":"GAL","6":"GUY","7":"GAL","8":"GAL","9":"GUY"},"upload date":{"0":"11/17/17","1":"11/17/17","2":"11/16/17","3":"11/16/17","4":"11/16/17","5":"11/17/17","6":"11/17/17","7":"11/17/17","8":"11/17/17","9":"11/15/17"},"price":{"0":12.00,"1":1.23, "2":34.00, "3":69.69, "4":13.72, "5":49.98, "6":98.02, "7":1.02,"8":98.23,"9":12.69}}
df = pd.DataFrame(data=data)
df = df[['uid','retailer','upload date','price']]
df['upload date']=pd.to_datetime(df['upload date'])

Solution: 解:

idx = df.groupby(['uid','retailer'])['upload date'].max().rename('upload date')
idx.reset_index(inplace=True)
solution = idx.merge(df, how='left', on=['uid','retailer','upload date'])

Question: 题:

I would like to be able to leverage indices to get to my solution. 我希望能够利用索引来获得解决方案。 Either, I'd like to be able to use join, or find the max date of each pairing with a function that retains the indices of the original data frame. 或者,我希望能够使用join,或者使用保留原始数据帧索引的函数查找每个配对的最大日期。

JOIN ERROR: 联接错误:

idx.set_index(['uid','retailer','upload date']).join(df, on=['uid','retailer','upload date'])

Returns: 返回:

ValueError: len(left_on) must equal the number of levels in the index of "right"

IIUC, idxmax IIUC, idxmax

df.loc[df.groupby(['uid','retailer'])['upload date'].idxmax()]
Out[168]: 
   uid retailer upload date  price
5  123      GAL  2017-11-17  49.98
0  123      GUY  2017-11-17  12.00
1  345      GUY  2017-11-17   1.23
7  369      GAL  2017-11-17   1.02
6  678      GUY  2017-11-17  98.02
8  890      GAL  2017-11-17  98.23

Or reindex reindex

df.reindex(df.groupby(['uid','retailer'])['upload date'].idxmax().values)

If you want join the document stated :Join columns with other DataFrame either on index or on a key column 如果要join文档说明:在索引键列上将列与其他DataFrame合并

idx.set_index(['uid','retailer','upload date']).join(df.set_index(['uid','retailer','upload date']))
Out[175]: 
                          price
uid retailer upload date       
123 GAL      2017-11-17   49.98
    GUY      2017-11-17   12.00
345 GUY      2017-11-17    1.23
369 GAL      2017-11-17    1.02
678 GUY      2017-11-17   98.02
890 GAL      2017-11-17   98.23

To get your expected output you need adding .reset_index() at the end 为了获得预期的输出,您需要在末尾添加.reset_index()

Or doing something like 或做类似的事情

idx.join(df.set_index(['uid','retailer','upload date']),on=['uid','retailer','upload date'])
Out[177]: 
   uid retailer upload date  price
0  123      GAL  2017-11-17  49.98
1  123      GUY  2017-11-17  12.00
2  345      GUY  2017-11-17   1.23
3  369      GAL  2017-11-17   1.02
4  678      GUY  2017-11-17  98.02
5  890      GAL  2017-11-17  98.23

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM