将新列添加到 dataframe 这是基于重复日期时间索引的前一个月的另一列的值，其他列作为标识符

Question

If I have this df called feature_df :如果我有这个名为feature_df的df：

Each row represents a particular "cohort" of mortgage loan groups.每行代表抵押贷款组的特定“队列”。 I want to select the Wac from each row and create a new column called lagged_WAC which is filled with Wac values from the month prior, based on the datetime index called y_m .我想Wac每一行的 Wac 并创建一个名为lagged_WAC的新列，其中填充了前一个月的Wac值，基于名为y_m的日期时间索引。 Additionally, each lagged Wac must correspond with the Vintage and cluster column values for that row.此外，每个滞后Wac必须与该行的Vintage和cluster列值相对应。 That is why there are repeats for each date.这就是为什么每个日期都有重复的原因。 Each row contains data for each mortgage cohort (Vintage, Coupon, and bondsec_code) at that time.每行包含当时每个抵押贷款群组（Vintage、Coupon 和 bondsec_code）的数据。 The dataset starts at February 2019 though, so there wouldn't be any "previous months values" for any of those rows.不过，数据集从 2019 年 2 月开始，因此任何这些行都不会有任何“前几个月的值”。 How can I do this?我怎样才能做到这一点？

Here is a more reproducible example with just the index and Wac column:这是一个仅包含索引和Wac列的重现性更高的示例：

              Wac
y_m 
2019-04-01  3.4283
2019-04-01  4.1123
2019-04-01  4.4760
2019-04-01  3.9430
2019-04-01  4.5702
... ...
2022-06-01  2.2441
2022-06-01  4.5625
2022-06-01  5.6446
2022-06-01  4.0584
2022-06-01  3.0412

I have tried implementing this code to generate a copy dataframe and then lagged values by a month, then merging back with the original, but I'm not sure how to check that the Wac_y values returned with the new merged df are correct:我尝试实现此代码以生成副本 dataframe 然后将值滞后一个月，然后与原始值合并，但我不确定如何检查新合并 df 返回的Wac_y值是否正确：

df1 = feature_df.copy().reset_index()
df1['new_date'] = df1['y_m'] + pd.DateOffset(months=-1)
df1 = df1[['Wac', 'new_date']]
feature_df.merge(df1, left_index=True, right_on = 'new_date')

For example, there are values for 2019-01-01 which I don't know where they come from since the original dataframe doesn't have data for that month, and the shape goes from 20,712 rows to 12,297,442 rows例如， 2019-01-01的值我不知道它们来自哪里，因为原始 dataframe 没有该月的数据，并且形状从 20,712 行变为 12,297,442 行

Answer 1

I can't test it because I don't have representative data, but from what I see you could try something like this.我无法测试它，因为我没有代表性数据，但据我所知，你可以尝试这样的事情。

df['lagged_WAC'] = df.groupby('cluster', sort=False, as_index=False)['Wac'].shift(1)

If each month has unique clusters for each Wac value, you can groupby cluster and then shift the each row in a group by one to the past.如果每个月对于每个Wac值都有唯一的集群，您可以按cluster分组，然后将组中的每一行移到过去。 If you need to groupby more than one column you need to pass a list to the groupby like df.groupby(['Vintage', 'cluster']) .如果您需要对多个列进行分组，则需要将列表传递给 groupby，例如df.groupby(['Vintage', 'cluster']) 。

Made a little example dataset to show you what I'm thinking of.制作了一个小示例数据集来向您展示我的想法。 This is my input:这是我的输入：

        Month       Wac cluster
0  2017-04-01  2.271980     car
1  2017-04-01  2.586608     bus
2  2017-04-01  2.071009   plane
3  2017-04-01  2.102676    boat
4  2017-05-01  2.222338     car
5  2017-05-01  2.617924     bus
6  2017-05-01  2.377280   plane
7  2017-05-01  2.150043    boat
8  2017-06-01  2.203132     car
9  2017-06-01  2.072133     bus
10 2017-06-01  2.223499   plane
11 2017-06-01  2.253821    boat
12 2017-07-01  2.228020     car
13 2017-07-01  2.717485     bus
14 2017-07-01  2.446508   plane
15 2017-07-01  2.607244    boat
16 2017-08-01  2.116647     car
17 2017-08-01  2.820238     bus
18 2017-08-01  2.186937   plane
19 2017-08-01  2.827701    boat

df['lagged_WAC'] = df.groupby('cluster', sort=False,as_index=False)['Wac'].shift(1)
print(df)

Output: Output：

        Month       Wac cluster  lagged_WAC
0  2017-04-01  2.271980     car         NaN
1  2017-04-01  2.586608     bus         NaN
2  2017-04-01  2.071009   plane         NaN
3  2017-04-01  2.102676    boat         NaN
4  2017-05-01  2.222338     car    2.271980
5  2017-05-01  2.617924     bus    2.586608
6  2017-05-01  2.377280   plane    2.071009
7  2017-05-01  2.150043    boat    2.102676
8  2017-06-01  2.203132     car    2.222338
9  2017-06-01  2.072133     bus    2.617924
10 2017-06-01  2.223499   plane    2.377280
11 2017-06-01  2.253821    boat    2.150043
12 2017-07-01  2.228020     car    2.203132
13 2017-07-01  2.717485     bus    2.072133
14 2017-07-01  2.446508   plane    2.223499
15 2017-07-01  2.607244    boat    2.253821
16 2017-08-01  2.116647     car    2.228020
17 2017-08-01  2.820238     bus    2.717485
18 2017-08-01  2.186937   plane    2.446508
19 2017-08-01  2.827701    boat    2.607244

the first month has only Nan because there is no earlier month.第一个月只有Nan ，因为没有更早的月份。 Each car in that df has now the value for car in the previous month, each boat for boat in the previous month and so on.该df中的每辆汽车现在都有上个月的汽车价值，上个月的每艘船的价值，依此类推。

将新列添加到 dataframe 这是基于重复日期时间索引的前一个月的另一列的值，其他列作为标识符

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-09-19 20:35:19

将新列添加到 dataframe 这是基于重复日期时间索引的前一个月的另一列的值，其他列作为标识符

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-09-19 20:35:19

解决方案1
1 已采纳 2022-09-19 20:35:19