简体   繁体   English

使用逻辑条件计算平均值

[英]Calculating the mean using logical condition

I have a football dataset for a season and some variable are: player_id , week and points (a grade for each player in a match). 我有一个赛季的足球数据集,其中一些变量是: player_idweekpoints (一场比赛中每个球员的等级)。

So, each player_id appears several times in my dataset. 因此,每个player_id在我的数据集中出现几次。

My goal is to calculate the average points for each player, but just to previous weeks. 我的目标是计算每位球员的平均得分,但只计算前几周的平均得分。

For example, to the row where player_id=5445 and week=10 , I want the mean when data has player_id=5445 and week is from 1 to 9. 例如,对于player_id=5445week=10 ,我想要数据为player_id=5445并且week从1到9时的平均值。

I know I can do it filtering data for each row and calculating it. 我知道我可以过滤每一行的数据并进行计算。 But I hope to do it in a smarter/faster way... 但我希望以一种更聪明/更快的方式来做...

I thought something like: 我以为是这样的:

aggregate(mydata$points, FUN=mean, 
          by=list(player_id=mydata$player_id, week<mydata$week))

but it did not work 但是没有用

Thankss!!! 谢谢!!!

Here's a solution along with some sample data, 这是一个解决方案以及一些示例数据,

football_df <- 
  data.frame(player_id = c(1, 2, 3, 4),
             points = as.integer(runif(40, 0, 10)), 
             week = rep(1:10, each = 4))

Getting a running average: 获得运行平均值:

require(dplyr)
football_df %>% 
      group_by(player_id) %>%    # the group to perform the stat on
      arrange(week) %>%          # order the weeks within each group
      mutate(avg = cummean(points) ) %>% # for each week get the cumulative mean
      mutate(avg = lag(avg) ) %>% # shift cumulative mean back one week
      arrange(player_id) # sort by player_id

Here's the first two players of the resulting table, for which you can see that for player 1 in week 2, the previous week's average is 7, and in week 3, the previous week's average is (9+7) / 2 = 8 ... : 这是结果表中的前两名玩家,对于您来说,第2周的玩家1的前一周平均值为7,而在第3周的前一周的平均值为(9 + 7)/ 2 = 8。 ..:

   player_id points week      avg
1          1      7    1       NA
2          1      9    2 7.000000
3          1      9    3 8.000000
4          1      1    4 8.333333
5          1      4    5 6.500000
6          1      8    6 6.000000
7          1      0    7 6.333333
8          1      2    8 5.428571
9          1      5    9 5.000000
10         1      8   10 5.000000
11         2      6    1       NA
12         2      9    2 6.000000
13         2      5    3 7.500000
14         2      1    4 6.666667
15         2      0    5 5.250000
16         2      9    6 4.200000
17         2      8    7 5.000000
18         2      6    8 5.428571
19         2      6    9 5.500000
20         2      8   10 5.555556

I will use your data but with a call to set.seed to make the results reproducible. 我将使用您的数据,但会调用set.seed以使结果可重复。 Then I will call aggregate with the formula interface. 然后,我将使用公式接口调用aggregate Note that I've changed the name of the variable week to last_week to be used in subset . 请注意,我已将变量week的名称更改为last_week ,以便在subset

set.seed(2550)    # make the results reproducible

player_id <- c(3242,56546,76575,4234,654654,6564,43242,42344,4342,6776,5432,8796,54767)
week <- 1:30
points <- rnorm(390)
mydata <- data.frame(player_id = rep(player_id, 30), 
                     week = rep(week,13),points)

last_week <- 10
agg <- aggregate(points ~ player_id + week, data = subset(mydata, week < last_week), mean)
head(agg)
#  player_id week     points
#1      3242    1 -1.3281831
#2      4234    1  0.3578657
#3      4342    1 -0.8267423
#4      5432    1 -0.4245487
#5      6564    1 -0.2968879
#6      6776    1  0.8348178

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM