[英]Keep last set of obs within a group with the same (most recent) date
Is there a one-step way to keep only the latest observations within a "group"?有没有一种方法可以在一个“组”中只保留最新的观察结果?
For example, I want to keep only the most recent observations for each PrimaryID-SecondaryID pair.例如,我只想保留每个 PrimaryID-SecondaryID 对的最新观察结果。
PrimaryID SecondaryID SubAccount Value ReportDate
0 1 A 123 5618.48 2022-01-01
1 1 A 456 8206.23 2022-01-01
2 1 A 123 6722.05 2022-07-01
3 1 A 456 5500.53 2022-07-01
4 1 B 789 8990.75 2022-02-01
5 1 B 987 6294.63 2022-02-01
6 1 B 789 8389.60 2022-03-01
7 1 B 246 343.02 2022-03-01
8 2 X 234 4157.57 2022-02-01
9 2 X 752 8218.00 2022-02-01
10 2 X 234 6430.68 2022-03-01
11 2 X 755 7148.57 2022-03-01
12 2 Y 731 5406.63 2022-05-02
13 2 Y 480 2429.83 2022-05-02
14 2 Y 731 6251.38 2022-06-01
15 2 Y 841 8256.93 2022-06-01
This is one way to accomplish this, but it seems sloppy.这是实现此目的的一种方法,但似乎很草率。
df['lastRptDt'] = df.groupby(['PrimaryID', 'SecondaryID'])['ReportDate'].transform(max)
df1 = df[(df['ReportDate']==df['lastRptDt'])]
This is the desired output:这是所需的输出:
PrimaryID SecondaryID SubAccount Value ReportDate lastRptDt
2 1 A 123 6722.05 2022-07-01 2022-07-01
3 1 A 456 5500.53 2022-07-01 2022-07-01
6 1 B 789 8389.60 2022-03-01 2022-03-01
7 1 B 246 343.02 2022-03-01 2022-03-01
10 2 X 234 6430.68 2022-03-01 2022-03-01
11 2 X 755 7148.57 2022-03-01 2022-03-01
14 2 Y 731 6251.38 2022-06-01 2022-06-01
15 2 Y 841 8256.93 2022-06-01 2022-06-01
How about this?这个怎么样?
df.set_index(['PrimaryID', 'SecondaryID', 'ReportDate']).loc[:,:,df.groupby(['PrimaryID', 'SecondaryID']).ReportDate.max()]
Out[54]:
SubAccount Value lastRptDt
PrimaryID SecondaryID ReportDate
1 A 2022-07-01 123 6722.05 2022-07-01
2022-07-01 456 5500.53 2022-07-01
B 2022-03-01 789 8389.60 2022-03-01
2022-03-01 246 343.02 2022-03-01
2 X 2022-03-01 234 6430.68 2022-03-01
2022-03-01 755 7148.57 2022-03-01
Y 2022-06-01 731 6251.38 2022-06-01
2022-06-01 841 8256.93 2022-06-01
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.