[英]Groupby and get value offset by one year in pandas
My goal today is to follow each ID that belongs to Category==1
in a given date, one year later.我今天的目标是在一年后的给定日期跟踪属于
Category==1
的每个 ID。 So I have a dataframe like this:所以我有一个像这样的 dataframe :
Period ID Amount Category
20130101 1 100 1
20130101 2 150 1
20130101 3 100 1
20130201 1 90 1
20130201 2 140 1
20130201 3 95 1
20130201 5 250 0
. . .
20140101 1 40 1
20140101 2 70 1
20140101 5 160 0
20140201 1 35 1
20140201 2 65 1
20140201 5 150 0
For example, in 20130201
I have 2 ID's that belong to Category
1: 1,2,3, but just 2 of them are present in 20140201
: 1,2.例如,在
20130201
中,我有 2 个属于Category
1 的 ID:1,2,3,但其中只有 2 个出现在20140201
中:1,2。 So I need to get the value of Amount
, only for those ID's, one year later, like this:所以我需要在一年后获得
Amount
的值,仅针对那些 ID,如下所示:
Period ID Amount Category Amount_t1
20130101 1 100 1 40
20130101 2 150 1 70
20130101 3 100 1 nan
20130201 1 90 1 35
20130201 2 140 1 65
20130201 3 95 1 nan
20130201 5 250 0 nan
. . .
20140101 1 40 1 nan
20140101 2 70 1 nan
20140101 5 160 0 nan
20140201 1 35 1 nan
20140201 2 65 1 nan
20140201 5 150 0 nan
So, if the ID doesn't appear next year or belong to Category
0, I'll get a nan
.因此,如果该 ID 明年不出现或不属于
Category
0,我将得到一个nan
。 My first approach was to get the list of unique ID's on each Period
and then trying to map that to the next year, using some sort of combination of groupby()
and isin()
like this:我的第一种方法是获取每个
Period
的唯一 ID 列表,然后尝试 map 到明年,使用groupby()
和isin()
的某种组合,如下所示:
aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)
But I didn't know how to keep going.但我不知道如何继续前进。 I'm thinking some kind of
groupby('ID')
might be more efficient too.我在想某种
groupby('ID')
也可能更有效。 If it were a simple shift()
that would be easy, but I'm not sure about how to get the value offset by a year by group.如果它是一个简单的
shift()
会很容易,但我不确定如何按组获取值偏移一年。
You can create lagged features with an exact merge after you manually lag one of the join keys.在手动滞后一个连接键后,您可以创建具有精确合并的滞后要素。
import pandas as pd
# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1) # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})
# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]
# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')
Period ID Amount Category Amount_t1
0 2013-01-01 1 100 1 40.0
1 2013-01-01 2 150 1 70.0
2 2013-01-01 3 100 1 NaN
3 2013-02-01 1 90 1 35.0
4 2013-02-01 2 140 1 65.0
5 2013-02-01 3 95 1 NaN
6 2013-02-01 5 250 0 NaN
7 2014-01-01 1 40 1 NaN
8 2014-01-01 2 70 1 NaN
9 2014-01-01 5 160 0 NaN
10 2014-02-01 1 35 1 NaN
11 2014-02-01 2 65 1 NaN
12 2014-02-01 5 150 0 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.