简体   繁体   English

Python 中 window 长度变化的滚动平均值

[英]Rolling mean with varying window length in Python

I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years.我正在使用 NLSY79 数据,并且正在尝试构建一个平均超过 4 年的“平滑”收入变量。 Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially.从 1979 年到 1994 年,NLSY 每年进行一次调查,而 1996 年之后每两年进行一次调查。 This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.这意味着我的平滑收入变量将平均 1994 年之前的四个观察值和 1996 年之后的两个观察值。

I would like my smoothed income variable to satisfy the following criteria:我希望我的平滑收入变量满足以下标准:

1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward 1) 应该是从 1979 年到 1994 年的 4 次收入观察的平均值,而从 1996 年起只有 2 次

2) The window should START from a given observation rather than be centered at it. 2) window 应该从给定的观察开始,而不是以它为中心。 Therefore, my smoothed income variable should tell me the average income over the four years starting from that date因此,我的平滑收入变量应该告诉我从该日期开始的四年内的平均收入

3) It should ignore NaNs 3) 它应该忽略 NaN

It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)因此,它应该如下所示(请注意,我只计算了可以使用我提供的数据计算的“平滑收入”的值。)

id year  income 'smoothed income'

1  1979  20,000  21,250  
1  1980  22,000  
1  1981  21,000
1  1982  22,000
...
1  2014  34,000   34,500
1  2016  35,000   
2  1979  28,000   28,333
2  1980  NaN
2  1981  28,000
2  1982  29,000

I am relatively new to dataframe manipulation with pandas, so here is what I have tried:我对使用 pandas 操作 dataframe 比较陌生,所以这是我尝试过的:

smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] =  smooth.reset_index(level=0, drop=True)

This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).此代码考虑了 NaN,但除此之外没有实现目标 2) 和 3)。

Any help would be much appreciated任何帮助将非常感激

Use:利用:

df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
                                                           .ffill().rolling(4).mean().shift(-3)).reset_index() 

Ok, I've modified the code provided by ansev to make it work.好的,我已经修改了 ansev 提供的代码以使其工作。 filling in NaNs was causing the problems.填充 NaN 导致了问题。

Here's the modified code:这是修改后的代码:

df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
                                                           .rolling(4, min_periods = 1).mean().shift(-3)).reset_index()

The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (eg from 2014 onward, because my data goes until 2016).我现在唯一的问题是,当剩下的时间少于 4 年时(例如从 2014 年开始,因为我的数据一直到 2016 年),均值不会被计算。 Is there a way of shortening the window length after 2014?有没有办法在 2014 年后缩短 window 长度?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM