简体   繁体   English

有条件的扩展群聚集熊猫

[英]Conditional expanding group aggregation pandas

For some data preprocessing I have a huge dataframe where I need historical performance within groups. 对于某些数据预处理,我有一个巨大的数据框,我需要组内的历史性能。 However since it is for a predictive model that runs a week before the target I cannot use any data that happened in that week in between. 但是,由于它是针对在目标之前一周运行的预测模型,因此我无法使用那一周之间发生的任何数据。 There are a variable number of rows per day per group which means I cannot always discard the last 7 values by using a shift on the expanding functions, I have to somehow condition on the datetime of rows before it. 每组每天有可变的行数,这意味着我不能总是通过使用扩展函数上的移位来舍弃最后7个值,我必须以某种方式限制行的日期时间。 I can write my own function to apply on the groups however this is usually very slow in my experience (albeit flexible). 我可以编写自己的函数以应用于组,但是根据我的经验,这通常很慢(尽管很灵活)。 This is how I did it without conditioning on date and just looking at previous records: 这是我这样做的方法,没有日期限制,仅查看以前的记录:

df.loc[:, 'new_col'] = df_gr['old_col'].apply(lambda x: x.expanding(5).mean().shift(1))

The 5 represents that I want at least a sample size of 5 or to put it to NaN. 5表示我希望样本数量至少为5或放入NaN。

Small example with aggr_mean looking at the mean of all samples within group A at least a week earlier: 一个小例子,其中aggr_mean至少一周前查看A组中所有样本的平均值:

group | dt       | value  | aggr_mean
A     | 01-01-16 | 5      | NaN
A     | 03-01-16 | 4      | NaN
A     | 08-01-16 | 12     | 5 (only looks at first row)
A     | 17-01-16 | 11     | 7 (looks at first three rows since all are 
                               at least a week earlier)

new answer 新答案
using @JulienMarrec's better example 使用@JulienMarrec的更好示例

dt           group  value   
2016-01-01     A      5
2016-01-03     A      4
2016-01-08     A     12
2016-01-17     A     11
2016-01-04     B     10
2016-01-05     B      5
2016-01-08     B     12
2016-01-17     B     11

Condition df to be more useful 条件df更有用

d1 = df.drop('group', 1)
d1.index = [df.group, df.groupby('group').cumcount().rename('gidx')]
d1

在此处输入图片说明

create a custom function that does what old answer did. 创建一个做旧答案的自定义函数。 Then apply it within groupby 然后在groupby

def lag_merge_asof(df, lag):
    d = df.set_index('dt').value.expanding().mean()
    d.index = d.index + pd.offsets.Day(lag)
    d = d.reset_index(name='aggr_mean')
    return pd.merge_asof(df, d)

d1.groupby(level='group').apply(lag_merge_asof, lag=7)

在此处输入图片说明

we can get some formatting with this 我们可以得到一些格式化

d1.groupby(level='group').apply(lag_merge_asof, lag=7) \
    .reset_index('group').reset_index(drop=True)

在此处输入图片说明


old answer 旧答案

create a lookback dataframe by offsetting the dates by 7 days, then use it to pd.merge_asof 通过将日期偏移7天来创建lookback数据pd.merge_asof ,然后将其用于pd.merge_asof

lookback = df.set_index('dt').value.expanding().mean()

lookback.index += pd.offsets.Day(7)
lookback = lookback.reset_index(name='aggr_mean')

lookback

在此处输入图片说明

pd.merge_asof(df, lookback, left_on='dt', right_on='dt')

在此处输入图片说明

Given this dataframe where I added another group in order to more clearly see what's happening: 给定这个数据框,我添加了另一个组,以便更清楚地了解正在发生的事情:

dt           group  value                               
2016-01-01     A      5
2016-01-03     A      4
2016-01-08     A     12
2016-01-17     A     11
2016-01-04     B     10
2016-01-05     B      5
2016-01-08     B     12
2016-01-17     B     11

Let's load it: 让我们加载它:

df = pd.read_clipboard(index_col=0, sep='\s+', parse_dates=True)

Now we can use a groupby, resample daily, and do an shift that 7 days, and take the mean: 现在,我们可以使用groupby,每天重新采样并进行7天的轮班,并取均值:

x = df.groupby('group')['value'].apply(lambda gp: gp.resample('1D').mean().shift(7).expanding().mean())

Now you can merge left that back into your df: 现在,您可以合并left说回你DF:

merged = df.reset_index().set_index(['group','dt']).join(x, rsuffix='_aggr_mean', how='left')
merged

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM