简体   繁体   English

具有基于列总和的自定义回溯长度的熊猫滚动窗口

[英]Pandas rolling window with custom look back length based on column sum

Given a pandas dataframe with two columns, "atbats" and "hits", indexed by a date, is it possible to get the most recent historical batting average (average number of hits per atbat)? 给定一个熊猫数据框,其中有以日期为索引的“ atbats”和“ hits”两列,是否可以获得最新的历史击球平均值(每个atbat的平均命中次数)? For example, the historical batting average could be that with the fewest atbats greater than 10. It is sort of like a rolling window with a conditional number of look-back periods. 例如,历史击球平均数可以是最少击球数大于10的平均数。这有点像滚动窗口,带有一定数量的回溯期。 For example, given: 例如,给定:

      date, atbats, hits, 
2017-01-01,      5,    2,
2017-01-02,      6,    3,
2017-01-03,      1,    1,
2017-01-04,      12,   3,
2017-01-04,      1,    0,

On the first day, there have been no historical atbats. 在第一天,没有历史性的攻击。 On the second day, only 6. Since both are less than 10, they can be NaN or just 0. 在第二天,只有6。由于两者都小于10,因此它们可以是NaN或仅为0。

On the third day, we would look back on the last two days and see 5+6 atbats with an average of (2+3)/(5+6) = 0.45 hits/atbat. 在第三天,我们将回顾过去两天,看到5 + 6次击球,平均(2 + 3)/(5 + 6)= 0.45次/击球。

On the third day, we would look back on the last three days and get (2+3+1)/(5+6+1) = 0.5 hits/atbat. 在第三天,我们将回顾过去的三天,得出(2 + 3 + 1)/(5 + 6 + 1)= 0.5次命中/每击。

On the fourth day, we would look back on just the last day and get 4/16 = 0.25 hits/atbat. 在第四天,我们将回首最后一天,获得4/16 = 0.25次点击/ atbat。 Since the last day has more than 10 (16), we don't need to look any further. 由于最后一天的时间超过10(16),因此我们无需再看了。

The final dataframe would look like: 最终的数据帧如下所示:

      date, atbats, hits,  pastAtbats, pastHits, avg,
2017-01-01,      5,    2,           0,       0,   0,
2017-01-02,      6,    3,           0,       0,   0,
2017-01-03,      1,    1,          11,       5,   0.45,
2017-01-04,      16,   4,          12,       6,   0.50,
2017-01-04,      1,    0,          16,       4,   0.25,

Is this sort of calculation possible in pandas? 在熊猫中可以进行这种计算吗?

The only solution I can think of is pure brute force - divide the hits by atbats in each row, replicate each row x times, where x = atbats, and then just do a rolling window of 10. But in my dataframe, the "atbats" average about 80 per day, so it would massively increase the size of the dataframe and total number of windows to calculate. 我能想到的唯一解决方案是纯蛮力-将命中除以atbats,每行重复x次,其中x = atbats,然后滚动窗口为10。但是在我的数据框中,“ atbats”平均每天约80个,因此会大大增加数据框的大小和要计算的窗口总数。

Use iterrows to achieve what you need. 使用迭代来实现您所需要的。 See below: 见下文:

Original dataframe: 原始数据框:

index atbats  hits
1       5     2
2       6     3
3       1     1
4      16     4
4       1     0
5       1     0
6      14     2
7       5     1

Code: 码:

data = []
last = [0,0]
past_atbats = 0
past_hits = 0
for i, row in df.iterrows():
    if( last[0] >= 10):
        data.append(last.copy())
    else:
        data.append([0,0])

    if(row['atbats'] >= 10):
        last[0] = row['atbats']
        last[1] = row['hits']
    else:
        last[0] += row['atbats']
        last[1] += row['hits']

df_past = pd.DataFrame(data=data,index=df.index,columns=['past_atbats','past_hits'])
df = df.merge(df_past,left_index=True,right_index=True)
df['avg'] = df['past_hits'].divide(df['past_atbats'])

Result in: 造成:

index atbats  hits  past_atbats  past_hits       avg
1       5     2            0          0       NaN
2       6     3            0          0       NaN
3       1     1           11          5  0.454545
4      16     4           12          6  0.500000
4      16     4           16          4  0.250000
4       1     0           12          6  0.500000
4       1     0           16          4  0.250000
5       1     0           17          4  0.235294
6      14     2           18          4  0.222222
7       5     1           14          2  0.142857

Probably optimization can be done but I think it will helps you. 可能可以进行优化,但我认为它将对您有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM