简体   繁体   English

R - 最近观察日期的n个观测值的平均值

[英]R - Average of n observations from the most recent observed date

I have a data frame with an ID of individual, Date of observation, and Metric measured. 我有一个数据框,其中包含个人ID,观察日期和度量标准。 It looks like this: 它看起来像这样:

ID  Date    Metric
a   Day 1     9
a   Day 2     8
a   Day 3     9
a   Day 4     8
a   Day 5     7
a   Day 6     6
a   Day 7     5
a   Day 8     4
a   Day 9     3
a   Day 10    3
b   Day 1     6
b   Day 2     7
b   Day 3     6
b   Day 4     7
b   Day 5     8
b   Day 6     9
b   Day 7     9
b   Day 8     9

I would like to condense this into a single row on the ID and add variables (prev2, prev3, prev4, prev5) that calculate the average of n observations since the latest observation date (but not including the lastdate in the average). 我想将其浓缩为ID中的单行并添加变量(prev2,prev3,prev4,prev5),这些变量计算自最近观察日期以来n次观测的平均值(但不包括平均值中的最后日期)。 For example - "prev2" is the average of the 2 most recent observations and "prev3" is the average of the 3 most recent observations. 例如 - “prev2”是最近2次观测的平均值,“prev3”是3次最近观测值的平均值。 So prev2 for ID A is the average of day 8 and day 9 (3.5). 因此ID A的prev2是第8天和第9天的平均值(3.5)。 prev3 for ID B is the average of day 5, 6, 7 (8.67). ID B的prev3是第5,6,7天(8.67)的平均值。 Ultimately looking back from the most recent/largest date and average a series of observations. 最终回顾最近/最大的日期并平均一系列观察。

It should look something like this: 它应该看起来像这样:

ID  lastDate    metric_avg  prev2   prev3   prev4   prev5   
a   Day 10      6.2          3.5     4       4.5      5 
b   Day 8       7.63           9    8.67    8.25    7.8 

I am trying to create predictor variables to analyze attrition at my company. 我正在尝试创建预测变量来分析我公司的损耗。 The thought being that 1 month or 2 months from leaving the job, Johnny's performance metrics change in such a way that could predict if Jimmy is going to attrite in the near future. 考虑到离开工作1个月或2个月后,约翰尼的绩效指标发生了变化,可以预测吉米是否会在不久的将来闯入。

Any recommendations or ideas how to analyze this data would be super sweet! 如何分析这些数据的任何建议或想法都会超级甜蜜!

Thanks! 谢谢!

I would go about it using dplyr , tidyr and magrittr . 我会用dplyrtidyrmagrittr

Data 数据

df <- 
    data.frame(ID=c(rep("a", 10), rep("b", 8), rep("c", 3), "d"),
               Date=c(paste("Day", 1:10), paste("Day", 1:8), paste("Day", 11:13), "Day 8"),
               Metric=c(9, 8, 9, 8, 7, 6, 5, 4, 3, 3, 6, 7, 6, 7, 8, 9, 9, 9, 3, 1, 8, 10))

Code

library(tidyr); library(dplyr); library(magrittr)

df %<>% separate(Date, into=c("d1", "d2")) %>% 
        arrange(ID, as.numeric(d2)) %>% 
        group_by(ID) %>% 
        mutate(last_Date=paste("Day", max(as.numeric(d2))), 
               metric_Avg=mean(Metric), 
               prev2=(lag(Metric)+lag(Metric, 2))/2,
               prev3=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3))/3,
               prev4=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3)+lag(Metric, 4))/4,
               prev5=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3)+lag(Metric, 4)+lag(Metric, 5))/5) %>% 
       ungroup %>% 
       filter(last_Date==paste(d1, d2)) %>% 
       select(ID, last_Date, metric_Avg, prev2, prev3, prev4, prev5)

df

Output 产量

  ID     last_Date   metric_Avg   prev2    prev3    prev4   prev5
1  a     Day 10      6.200        3.5      4.000    4.50    5.0
2  b     Day 8       7.625        9.0      8.667    8.25    7.8
3  c     Day 13      4.000        2.0      NA       NA      NA
4  d     Day 8       10.000       NA       NA       NA      NA

Remark 备注

If your Date column has dates then use the lubridate package. 如果您的Date列有日期,那么使用lubridate包。 the first few lines of the code would be: 代码的前几行是:

df$Date <- ymd(df$Date) # id the Date is of the form yyyy-mm-dd or yyyy/mm/dd

df %<>% arrange(ID, Date) %>% group_by(ID) %>% mutate(last_Date= max(Date)...

"lapply" might be useful: “lapply”可能有用:

ID <- unique(data$ID)

rowNr <- lapply(ID,function(id){which(data$ID==id)})

lastDate  <- lapply(rowNr,function(n){data$Date[rev(n)[1]]})
metricAvg <- lapply(rowNr,function(n){mean(data$Metric[n])})
prev2     <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),3),2)])})
prev3     <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),4),3)])})
prev4     <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),5),4)])})
prev5     <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),6),5)])})



output <- data.frame( ID         = ID,
                      last_Date  = unlist(lastDate),
                      metric_Avg = unlist(metricAvg),
                      prev2      = unlist(prev2),
                      prev3      = unlist(prev3),
                      prev4      = unlist(prev4),
                      prev5      = unlist(prev5)  )

Output: 输出:

> output
  ID last_Date metric_Avg prev2    prev3 prev4 prev5
1  a    Day 10      6.200   3.5 4.000000  4.50   5.0
2  b     Day 8      7.625   9.0 8.666667  8.25   7.8

Another example, that shows what happens if there are not enough days to calculate "prev5", "prev4",...: 另一个例子,显示如果没有足够的天数来计算“prev5”,“prev4”,...:

> data
   ID   Date Metric
1   a  Day 1      9
2   a  Day 2      8
3   a  Day 3      9
4   a  Day 4      8
5   a  Day 5      7
6   a  Day 6      6
7   a  Day 7      5
8   a  Day 8      4
9   a  Day 9      3
10  a Day 10      3
11  b  Day 1      6
12  b  Day 2      7
13  b  Day 3      6
14  b  Day 4      7
15  b  Day 5      8
16  b  Day 6      9
17  b  Day 7      9
18  b  Day 8      9
19  c Day 11      3
20  c Day 12      1
21  c Day 13      8
22  d  Day 8     10

Output: 输出:

> output
  ID last_Date metric_Avg prev2    prev3 prev4 prev5
1  a    Day 10      6.200   3.5 4.000000  4.50   5.0
2  b     Day 8      7.625   9.0 8.666667  8.25   7.8
3  c    Day 13      4.000   2.0       NA    NA    NA
4  d     Day 8     10.000    NA       NA    NA    NA
> 

This lightweight base R solution is even faster than its overcharged competitor: 这种轻量级基础R解决方案甚至比其过度充电的竞争对手更快:

> system.time(
+   for ( i in 1:5000)
+   {
+     ID <- unique(data$ID)
+      .... [TRUNCATED] 
   user  system elapsed 
  28.28    0.01   28.47 

> #-----------------------------------------------------------------
> 
> library(tidyr); library(dplyr); library(magrittr)

> system.time(
+   for ( i in 1:5000)
+   {
+     df <-data
+     
+     df %<>% separate(Date, into=c("d1", "d2")) %>% 
+       arrange(ID, as.numeri .... [TRUNCATED] 
   user  system elapsed 
  46.56    0.05   46.87 
> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM