I have a data frame with an ID of individual, Date of observation, and Metric measured. It looks like this:
ID Date Metric
a Day 1 9
a Day 2 8
a Day 3 9
a Day 4 8
a Day 5 7
a Day 6 6
a Day 7 5
a Day 8 4
a Day 9 3
a Day 10 3
b Day 1 6
b Day 2 7
b Day 3 6
b Day 4 7
b Day 5 8
b Day 6 9
b Day 7 9
b Day 8 9
I would like to condense this into a single row on the ID and add variables (prev2, prev3, prev4, prev5) that calculate the average of n observations since the latest observation date (but not including the lastdate in the average). For example - "prev2" is the average of the 2 most recent observations and "prev3" is the average of the 3 most recent observations. So prev2 for ID A is the average of day 8 and day 9 (3.5). prev3 for ID B is the average of day 5, 6, 7 (8.67). Ultimately looking back from the most recent/largest date and average a series of observations.
It should look something like this:
ID lastDate metric_avg prev2 prev3 prev4 prev5
a Day 10 6.2 3.5 4 4.5 5
b Day 8 7.63 9 8.67 8.25 7.8
I am trying to create predictor variables to analyze attrition at my company. The thought being that 1 month or 2 months from leaving the job, Johnny's performance metrics change in such a way that could predict if Jimmy is going to attrite in the near future.
Any recommendations or ideas how to analyze this data would be super sweet!
Thanks!
I would go about it using dplyr
, tidyr
and magrittr
.
Data
df <-
data.frame(ID=c(rep("a", 10), rep("b", 8), rep("c", 3), "d"),
Date=c(paste("Day", 1:10), paste("Day", 1:8), paste("Day", 11:13), "Day 8"),
Metric=c(9, 8, 9, 8, 7, 6, 5, 4, 3, 3, 6, 7, 6, 7, 8, 9, 9, 9, 3, 1, 8, 10))
Code
library(tidyr); library(dplyr); library(magrittr)
df %<>% separate(Date, into=c("d1", "d2")) %>%
arrange(ID, as.numeric(d2)) %>%
group_by(ID) %>%
mutate(last_Date=paste("Day", max(as.numeric(d2))),
metric_Avg=mean(Metric),
prev2=(lag(Metric)+lag(Metric, 2))/2,
prev3=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3))/3,
prev4=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3)+lag(Metric, 4))/4,
prev5=(lag(Metric)+lag(Metric, 2)+lag(Metric, 3)+lag(Metric, 4)+lag(Metric, 5))/5) %>%
ungroup %>%
filter(last_Date==paste(d1, d2)) %>%
select(ID, last_Date, metric_Avg, prev2, prev3, prev4, prev5)
df
Output
ID last_Date metric_Avg prev2 prev3 prev4 prev5
1 a Day 10 6.200 3.5 4.000 4.50 5.0
2 b Day 8 7.625 9.0 8.667 8.25 7.8
3 c Day 13 4.000 2.0 NA NA NA
4 d Day 8 10.000 NA NA NA NA
Remark
If your Date
column has dates then use the lubridate
package. the first few lines of the code would be:
df$Date <- ymd(df$Date) # id the Date is of the form yyyy-mm-dd or yyyy/mm/dd
df %<>% arrange(ID, Date) %>% group_by(ID) %>% mutate(last_Date= max(Date)...
"lapply" might be useful:
ID <- unique(data$ID)
rowNr <- lapply(ID,function(id){which(data$ID==id)})
lastDate <- lapply(rowNr,function(n){data$Date[rev(n)[1]]})
metricAvg <- lapply(rowNr,function(n){mean(data$Metric[n])})
prev2 <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),3),2)])})
prev3 <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),4),3)])})
prev4 <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),5),4)])})
prev5 <- lapply(rowNr,function(n){mean(data$Metric[head(tail(c(NA,n),6),5)])})
output <- data.frame( ID = ID,
last_Date = unlist(lastDate),
metric_Avg = unlist(metricAvg),
prev2 = unlist(prev2),
prev3 = unlist(prev3),
prev4 = unlist(prev4),
prev5 = unlist(prev5) )
Output:
> output
ID last_Date metric_Avg prev2 prev3 prev4 prev5
1 a Day 10 6.200 3.5 4.000000 4.50 5.0
2 b Day 8 7.625 9.0 8.666667 8.25 7.8
Another example, that shows what happens if there are not enough days to calculate "prev5", "prev4",...:
> data
ID Date Metric
1 a Day 1 9
2 a Day 2 8
3 a Day 3 9
4 a Day 4 8
5 a Day 5 7
6 a Day 6 6
7 a Day 7 5
8 a Day 8 4
9 a Day 9 3
10 a Day 10 3
11 b Day 1 6
12 b Day 2 7
13 b Day 3 6
14 b Day 4 7
15 b Day 5 8
16 b Day 6 9
17 b Day 7 9
18 b Day 8 9
19 c Day 11 3
20 c Day 12 1
21 c Day 13 8
22 d Day 8 10
Output:
> output
ID last_Date metric_Avg prev2 prev3 prev4 prev5
1 a Day 10 6.200 3.5 4.000000 4.50 5.0
2 b Day 8 7.625 9.0 8.666667 8.25 7.8
3 c Day 13 4.000 2.0 NA NA NA
4 d Day 8 10.000 NA NA NA NA
>
This lightweight base R solution is even faster than its overcharged competitor:
> system.time(
+ for ( i in 1:5000)
+ {
+ ID <- unique(data$ID)
+ .... [TRUNCATED]
user system elapsed
28.28 0.01 28.47
> #-----------------------------------------------------------------
>
> library(tidyr); library(dplyr); library(magrittr)
> system.time(
+ for ( i in 1:5000)
+ {
+ df <-data
+
+ df %<>% separate(Date, into=c("d1", "d2")) %>%
+ arrange(ID, as.numeri .... [TRUNCATED]
user system elapsed
46.56 0.05 46.87
>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.