简体   繁体   English

Groupby 并在 pandas 中获取偏移一年的值

[英]Groupby and get value offset by one year in pandas

My goal today is to follow each ID that belongs to Category==1 in a given date, one year later.我今天的目标是在一年后的给定日期跟踪属于Category==1的每个 ID。 So I have a dataframe like this:所以我有一个像这样的 dataframe :

Period      ID    Amount   Category
20130101    1       100       1
20130101    2       150       1
20130101    3       100       1
20130201    1       90        1
20130201    2       140       1
20130201    3       95        1
20130201    5       250       0
   .        .       .
20140101    1       40        1
20140101    2       70        1
20140101    5       160       0
20140201    1       35        1
20140201    2       65        1
20140201    5       150       0

For example, in 20130201 I have 2 ID's that belong to Category 1: 1,2,3, but just 2 of them are present in 20140201 : 1,2.例如,在20130201中,我有 2 个属于Category 1 的 ID:1,2,3,但其中只有 2 个出现在20140201中:1,2。 So I need to get the value of Amount , only for those ID's, one year later, like this:所以我需要在一年后获得Amount的值,仅针对那些 ID,如下所示:

Period      ID    Amount   Category    Amount_t1
20130101    1       100       1           40
20130101    2       150       1           70
20130101    3       100       1           nan
20130201    1       90        1           35
20130201    2       140       1           65
20130201    3       95        1           nan
20130201    5       250       0           nan
   .        .       .
20140101    1       40        1           nan
20140101    2       70        1           nan
20140101    5       160       0           nan
20140201    1       35        1           nan 
20140201    2       65        1           nan
20140201    5       150       0           nan  

So, if the ID doesn't appear next year or belong to Category 0, I'll get a nan .因此,如果该 ID 明年不出现或不属于Category 0,我将得到一个nan My first approach was to get the list of unique ID's on each Period and then trying to map that to the next year, using some sort of combination of groupby() and isin() like this:我的第一种方法是获取每个Period的唯一 ID 列表,然后尝试 map 到明年,使用groupby()isin()的某种组合,如下所示:

aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)

But I didn't know how to keep going.但我不知道如何继续前进。 I'm thinking some kind of groupby('ID') might be more efficient too.我在想某种groupby('ID')也可能更有效。 If it were a simple shift() that would be easy, but I'm not sure about how to get the value offset by a year by group.如果它是一个简单的shift()会很容易,但我不确定如何按组获取值偏移一年。

You can create lagged features with an exact merge after you manually lag one of the join keys.在手动滞后一个连接键后,您可以创建具有精确合并的滞后要素。

import pandas as pd

# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')

# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1)  # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})

# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]

# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')

       Period  ID  Amount  Category  Amount_t1
0  2013-01-01   1     100         1       40.0
1  2013-01-01   2     150         1       70.0
2  2013-01-01   3     100         1        NaN
3  2013-02-01   1      90         1       35.0
4  2013-02-01   2     140         1       65.0
5  2013-02-01   3      95         1        NaN
6  2013-02-01   5     250         0        NaN
7  2014-01-01   1      40         1        NaN
8  2014-01-01   2      70         1        NaN
9  2014-01-01   5     160         0        NaN
10 2014-02-01   1      35         1        NaN
11 2014-02-01   2      65         1        NaN
12 2014-02-01   5     150         0        NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM