简体   繁体   English

分组依据-选择最近的4个事件

[英]Group by - select most recent 4 events

I have the following df in pandas: 我在熊猫中有以下df:

df:
DATE    STOCK   DATA1   DATA2   DATA3
01/01/12    ABC 0.40    0.88    0.22
04/01/12    ABC 0.50    0.49    0.13
07/01/12    ABC 0.85    0.36    0.83
10/01/12    ABC 0.28    0.12    0.39
01/01/13    ABC 0.86    0.87    0.58
04/01/13    ABC 0.95    0.39    0.87
07/01/13    ABC 0.60    0.25    0.56
10/01/13    ABC 0.15    0.28    0.69
01/01/11    XYZ 0.94    0.40    0.50
04/01/11    XYZ 0.65    0.19    0.81
07/01/11    XYZ 0.89    0.59    0.69
10/01/11    XYZ 0.12    0.09    0.18
01/01/12    XYZ 0.25    0.94    0.55
04/01/12    XYZ 0.07    0.22    0.67
07/01/12    XYZ 0.46    0.08    0.54
10/01/12    XYZ 0.04    0.03    0.94
...

I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data). 我想按库存分组,按日期排序,然后对于指定的列(在本例中为DATA1和DATA3),我想对最后四项进行求和(TTM数据)。

The output would look like this: 输出如下所示:

DATE    STOCK   DATA1   DATA2   DATA3   DATA1_TTM   DATA3_TTM
01/01/12    ABC 0.40    0.88    0.22    NaN         NaN
04/01/12    ABC 0.50    0.49    0.13    NaN         NaN
07/01/12    ABC 0.85    0.36    0.83    NaN         NaN
10/01/12    ABC 0.28    0.12    0.39    2.03        1.56
01/01/13    ABC 0.86    0.87    0.58    2.49        1.92
04/01/13    ABC 0.95    0.39    0.87    2.94        2.66
07/01/13    ABC 0.60    0.25    0.56    2.69        2.39
10/01/13    ABC 0.15    0.28    0.69    2.55        2.70
01/01/11    XYZ 0.94    0.40    0.50    NaN         NaN
04/01/11    XYZ 0.65    0.19    0.81    NaN         NaN
07/01/11    XYZ 0.89    0.59    0.69    NaN         NaN
10/01/11    XYZ 0.12    0.09    0.18    2.59        2.18
01/01/12    XYZ 0.25    0.94    0.55    1.90        2.23
04/01/12    XYZ 0.07    0.22    0.67    1.33        2.09
07/01/12    XYZ 0.46    0.08    0.54    0.89        1.94
10/01/12    XYZ 0.04    0.03    0.94    0.82        2.70
...

My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. 到目前为止,我的方法是按日期排序,然后分组,然后遍历每个分组,如果有3个较旧的事件,则将当前事件求和。 Also, I want to check to see if the dates fall within 1 year. 另外,我想检查一下日期是否在1年内。 Can anyone offer a better way in Python? 谁能在Python中提供更好的方法? Thank you. 谢谢。

Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. 补充:为了澄清一年的时间,假设您取了最后四个日期,它分别为1/1 / 1993、4 / 1 / 12、7 / 1 / 12、10 / 1/12-数据错误。 I wouldn't want to sum those four. 我不想总结这四个。 I would want that one to say NaN. 我希望那个人说NaN。

For this I think you can use transform and rolling_sum . 为此,我认为您可以使用transformrolling_sum Starting from your dataframe, I might do something like: 从您的数据帧开始,我可能会做类似的事情:

>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
                  DATE STOCK  DATA1  DATA2  DATA3  DATA1_TTM  DATA3_TTM
0  2012-01-01 00:00:00   ABC   0.40   0.88   0.22        NaN        NaN
1  2012-04-01 00:00:00   ABC   0.50   0.49   0.13        NaN        NaN
2  2012-07-01 00:00:00   ABC   0.85   0.36   0.83        NaN        NaN
3  2012-10-01 00:00:00   ABC   0.28   0.12   0.39       2.03       1.57
4  2013-01-01 00:00:00   ABC   0.86   0.87   0.58       2.49       1.93
5  2013-04-01 00:00:00   ABC   0.95   0.39   0.87       2.94       2.67
6  2013-07-01 00:00:00   ABC   0.60   0.25   0.56       2.69       2.40
7  2013-10-01 00:00:00   ABC   0.15   0.28   0.69       2.56       2.70
8  2011-01-01 00:00:00   XYZ   0.94   0.40   0.50        NaN        NaN
9  2011-04-01 00:00:00   XYZ   0.65   0.19   0.81        NaN        NaN
10 2011-07-01 00:00:00   XYZ   0.89   0.59   0.69        NaN        NaN
11 2011-10-01 00:00:00   XYZ   0.12   0.09   0.18       2.60       2.18
12 2012-01-01 00:00:00   XYZ   0.25   0.94   0.55       1.91       2.23
13 2012-04-01 00:00:00   XYZ   0.07   0.22   0.67       1.33       2.09
14 2012-07-01 00:00:00   XYZ   0.46   0.08   0.54       0.90       1.94
15 2012-10-01 00:00:00   XYZ   0.04   0.03   0.94       0.82       2.70

[16 rows x 7 columns]

I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone. 我不知道您要问的是“另外,我想检查日期是否在1年以内”,所以我将不理会。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM