简体   繁体   English

计算pandas df中每组的滚动平均值

[英]Calculating rolling average per group in pandas df

I have a df like this:我有一个这样的df

date        car     model       mpg
1           ford    focus       10
1           ford    fiesta      15
1           ford    mustang     20
2           ford    focus       13
2           ford    fiesta      16
2           ford    mustang     27
3           ford    focus       13
3           ford    mustang     27
4           ford    focus       12
4           ford    fiesta      17

I would like to add a column rolling_mean with window = 2 of date per group of date, car, model so that I would have a df like this:我想为每组date, car, model添加一个带有window = 2 of date rolling_meandate, car, model这样我就会有一个这样的 df:

date        car     model       mpg     rolling_avg
1           ford    focus       10      nan
1           ford    fiesta      15      nan
1           ford    mustang     20      nan
2           ford    focus       13      11.5
2           ford    fiesta      16      15.5
2           ford    mustang     27      23.5
3           ford    focus       13      13
3           ford    mustang     27      27
4           ford    focus       12      12.5
4           ford    fiesta      17      Because fiesta is not in date=3, I want to (17+0)/2 = 8.5

What I tried:我试过的:

df_test.groupby(['date','car','model'])[['mpg']].rolling(window=2).mean().reset_index()
    date    car model   level_3 mpg
0   1   ford    fiesta  1       NaN
1   1   ford    focus   0       NaN
2   1   ford    mustang 2       NaN
3   2   ford    fiesta  4       NaN
4   2   ford    focus   3       NaN
5   2   ford    mustang 5       NaN
6   3   ford    focus   6       NaN
7   3   ford    mustang 7       NaN
8   4   ford    fiesta  9       NaN
9   4   ford    focus   8       NaN

Not sure what does level_3 stand for.不确定level_3代表什么。 Where is my mistake in trying to achieve the structure I desire?在尝试实现我想要的结构时,我的错误在哪里?

Here is the data used:这是使用的数据:

df = pd.DataFrame({'date':[1,1,1,2,2,2,3,3,4,4],
                   'car':['ford','ford','ford','ford','ford','ford','ford','ford','ford','ford'],
                   'model':['focus','fiesta','mustang','focus','fiesta','mustang','focus','mustang','focus','fiesta'],
                   'mpg':[10,15,20,13,16,27,13,27,12,17]})

For me working reshape value first by DataFrame.set_index and Series.unstack , if no match is added 0 , then use rolling and last reshape back by DataFrame.stack and add new column by DataFrame.join :对我来说,首先由工作重塑价值DataFrame.set_indexSeries.unstack ,如果没有匹配加0 ,然后使用rolling和最后重塑背部DataFrame.stack ,并通过添加新列DataFrame.join

s = (df.set_index(['date','car','model'])['mpg']
        .unstack(fill_value=0)
        .rolling(window=2)
        .mean()
        .stack()
        .rename('rolling_avg')
        )

df = df.join(s, on=['date','car','model'])
print (df)
   date   car    model  mpg  rolling_avg
0     1  ford    focus   10          NaN
1     1  ford   fiesta   15          NaN
2     1  ford  mustang   20          NaN
3     2  ford    focus   13         11.5
4     2  ford   fiesta   16         15.5
5     2  ford  mustang   27         23.5
6     3  ford    focus   13         13.0
7     3  ford  mustang   27         27.0
8     4  ford    focus   12         12.5
9     4  ford   fiesta   17          8.5

EDIT: If set_index with unstack fialed, there are duplicates like:编辑:如果set_indexunstack fialed,也有重复,如:

df = pd.DataFrame({'date':[1,1,1,2,2,2,3,3,4,4],
                   'car':['ford','ford','ford','ford','ford','ford','ford','ford','ford','ford'],
                   'model':['focus','focus','mustang','focus','focus','mustang','focus','mustang','focus','fiesta'],
                   'mpg':[10,15,20,13,16,27,13,27,12,17]})

print (df)
   date   car    model  mpg
0     1  ford    focus   10 <- dupe 1  ford    focus
1     1  ford    focus   15 <- dupe 1  ford    focus
2     1  ford  mustang   20
3     2  ford    focus   13 <- dupe 2  ford    focus
4     2  ford    focus   16 <- dupe 2  ford    focus
5     2  ford  mustang   27
6     3  ford    focus   13
7     3  ford  mustang   27
8     4  ford    focus   12
9     4  ford   fiesta   17

Then if possible first need unique pairs, here by aggregation sum (or mean like need):然后如果可能的话首先需要唯一的对,这里通过聚合sum (或mean需要):

df1 = df.pivot_table(index=['date','car'], 
                     columns='model', 
                     values='mpg', 
                     aggfunc='sum', 
                     fill_value=0)
print (df1)
model      fiesta  focus  mustang
date car                         
1    ford       0     25       20
2    ford       0     29       27
3    ford       0     13       27
4    ford      17     12        0

And then is possible use rolling , output is different like input data, because unique 'date','car','model' :然后可以使用rolling ,输出与输入数据不同,因为唯一的'date','car','model'

df1 = (df1.rolling(window=2)
        .mean()
        .stack(dropna=False)
        .rename('rolling_avg')
        .reset_index()
        )

print (df1)
  
    date   car    model  rolling_avg
0      1  ford   fiesta          NaN
1      1  ford    focus          NaN
2      1  ford  mustang          NaN
3      2  ford   fiesta          0.0
4      2  ford    focus         27.0
5      2  ford  mustang         23.5
6      3  ford   fiesta          0.0
7      3  ford    focus         21.0
8      3  ford  mustang         27.0
9      4  ford   fiesta          8.5
10     4  ford    focus         12.5
11     4  ford  mustang         13.5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM