Group by - select most recent 4 events

Question

I have the following df in pandas:

df:
DATE    STOCK   DATA1   DATA2   DATA3
01/01/12    ABC 0.40    0.88    0.22
04/01/12    ABC 0.50    0.49    0.13
07/01/12    ABC 0.85    0.36    0.83
10/01/12    ABC 0.28    0.12    0.39
01/01/13    ABC 0.86    0.87    0.58
04/01/13    ABC 0.95    0.39    0.87
07/01/13    ABC 0.60    0.25    0.56
10/01/13    ABC 0.15    0.28    0.69
01/01/11    XYZ 0.94    0.40    0.50
04/01/11    XYZ 0.65    0.19    0.81
07/01/11    XYZ 0.89    0.59    0.69
10/01/11    XYZ 0.12    0.09    0.18
01/01/12    XYZ 0.25    0.94    0.55
04/01/12    XYZ 0.07    0.22    0.67
07/01/12    XYZ 0.46    0.08    0.54
10/01/12    XYZ 0.04    0.03    0.94
...

I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data).

The output would look like this:

DATE    STOCK   DATA1   DATA2   DATA3   DATA1_TTM   DATA3_TTM
01/01/12    ABC 0.40    0.88    0.22    NaN         NaN
04/01/12    ABC 0.50    0.49    0.13    NaN         NaN
07/01/12    ABC 0.85    0.36    0.83    NaN         NaN
10/01/12    ABC 0.28    0.12    0.39    2.03        1.56
01/01/13    ABC 0.86    0.87    0.58    2.49        1.92
04/01/13    ABC 0.95    0.39    0.87    2.94        2.66
07/01/13    ABC 0.60    0.25    0.56    2.69        2.39
10/01/13    ABC 0.15    0.28    0.69    2.55        2.70
01/01/11    XYZ 0.94    0.40    0.50    NaN         NaN
04/01/11    XYZ 0.65    0.19    0.81    NaN         NaN
07/01/11    XYZ 0.89    0.59    0.69    NaN         NaN
10/01/11    XYZ 0.12    0.09    0.18    2.59        2.18
01/01/12    XYZ 0.25    0.94    0.55    1.90        2.23
04/01/12    XYZ 0.07    0.22    0.67    1.33        2.09
07/01/12    XYZ 0.46    0.08    0.54    0.89        1.94
10/01/12    XYZ 0.04    0.03    0.94    0.82        2.70
...

My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. Also, I want to check to see if the dates fall within 1 year. Can anyone offer a better way in Python? Thank you.

Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. I wouldn't want to sum those four. I would want that one to say NaN.

Answer 1

For this I think you can use transform and rolling_sum . Starting from your dataframe, I might do something like:

>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
                  DATE STOCK  DATA1  DATA2  DATA3  DATA1_TTM  DATA3_TTM
0  2012-01-01 00:00:00   ABC   0.40   0.88   0.22        NaN        NaN
1  2012-04-01 00:00:00   ABC   0.50   0.49   0.13        NaN        NaN
2  2012-07-01 00:00:00   ABC   0.85   0.36   0.83        NaN        NaN
3  2012-10-01 00:00:00   ABC   0.28   0.12   0.39       2.03       1.57
4  2013-01-01 00:00:00   ABC   0.86   0.87   0.58       2.49       1.93
5  2013-04-01 00:00:00   ABC   0.95   0.39   0.87       2.94       2.67
6  2013-07-01 00:00:00   ABC   0.60   0.25   0.56       2.69       2.40
7  2013-10-01 00:00:00   ABC   0.15   0.28   0.69       2.56       2.70
8  2011-01-01 00:00:00   XYZ   0.94   0.40   0.50        NaN        NaN
9  2011-04-01 00:00:00   XYZ   0.65   0.19   0.81        NaN        NaN
10 2011-07-01 00:00:00   XYZ   0.89   0.59   0.69        NaN        NaN
11 2011-10-01 00:00:00   XYZ   0.12   0.09   0.18       2.60       2.18
12 2012-01-01 00:00:00   XYZ   0.25   0.94   0.55       1.91       2.23
13 2012-04-01 00:00:00   XYZ   0.07   0.22   0.67       1.33       2.09
14 2012-07-01 00:00:00   XYZ   0.46   0.08   0.54       0.90       1.94
15 2012-10-01 00:00:00   XYZ   0.04   0.03   0.94       0.82       2.70

[16 rows x 7 columns]

I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone.

Group by - select most recent 4 events

Question

1 answers

solution1
0 ACCPTED 2013-12-11 19:29:24

Group by - select most recent 4 events

Question

1 answers

solution1 0 ACCPTED 2013-12-11 19:29:24

solution1
0 ACCPTED 2013-12-11 19:29:24