I'm looking for some help writing more efficient code. I have the following data set.
Report| ReportPeriod|ObsDate
1 | 15 |2017-12-31 00:00:00
1 | 15 |2017-12-31 06:00:00
1 | 15 |2017-12-31 12:30:00
2 | 11 |2018-01-01 07:00:00
2 | 11 |2018-01-01 13:00:00
2 | 11 |2018-01-01 16:30:00
First column is "Report" which is a unique identifier for a particular report. In the data set, there are only two reports (1 & 2). Second column is "ReportPeriod", which is same for a particular report. Report 1 is 15 hrs long and Report 2 is 11 hrs long. Column three "ObsDate" is different observations in a particular report.
Problem: I need to find out the time difference between observations grouped by "Report". I did that with the following code.
example<- data.frame(Report=c(1,1,1,2,2,2), ReportPeriod=c(15,15,15,11,11,11),
ObsDate=c(as.POSIXct("2017-12-31 00:00:00"), as.POSIXct("2017-12-31 06:00:00"),
as.POSIXct("2017-12-31 12:30:00"), as.POSIXct("2018-01-01 07:00:00"),
as.POSIXct("2018-01-01 13:00:00"), as.POSIXct("2018-01-01 16:30:00")))
example<- example %>% group_by(Report) %>%
mutate(DiffPeriod= (ObsDate-lag(ObsDate)))
The output is:
Report| ReportPeriod|ObsDate |DiffPeriod
1 | 15 |2017-12-31 00:00:00|NA
1 | 15 |2017-12-31 06:00:00|6.0
1 | 15 |2017-12-31 12:30:00|6.5
2 | 11 |2018-01-01 07:00:00|NA
2 | 11 |2018-01-01 13:00:00|6.0
2 | 11 |2018-01-01 16:30:00|3.5
Now the first two entries of the "Report" are NA. These values should be the sum of the DiffPeriod subtracted from the total report period "ReportPeriod".
I did that using the following code.
xyz<- data.frame()
for (i in unique(example$Report)) {
df<- example %>% filter(Report==i)
hrs<- sum(df$DiffPeriod, na.rm = TRUE)
tot<- df$ReportPeriod[1]
bal<- tot-hrs
df$DiffPeriod[1]<- bal
xyz<- xyz %>% bind_rows(df)
}
The final output is :
Report| ReportPeriod|ObsDate |DiffPeriod
1 | 15 |2017-12-31 00:00:00|2.5
1 | 15 |2017-12-31 06:00:00|6.0
1 | 15 |2017-12-31 12:30:00|6.5
2 | 11 |2018-01-01 07:00:00|1.5
2 | 11 |2018-01-01 13:00:00|6.0
2 | 11 |2018-01-01 16:30:00|3.5
Is there a better/more efficient way to do what I did in the for-loop above in the tidyverse
?
Thanks.
Assuming ReportPeriod
would always be in hours we can first get the difference between ObsDate
and lag(ObsDate)
and then replace
NA
which would be only first row by taking difference between first value of ReportPeriod
with sum
of DiffPeriod
for each group ( Report
).
library(dplyr)
example %>%
group_by(Report) %>%
mutate(DiffPeriod= difftime(ObsDate, lag(ObsDate), units = "hours"),
DiffPeriod = replace(DiffPeriod, is.na(DiffPeriod),
ReportPeriod[1] - sum(DiffPeriod, na.rm = TRUE)))
# Report ReportPeriod ObsDate DiffPeriod
# <dbl> <dbl> <dttm> <time>
#1 1 15 2017-12-31 00:00:00 2.5 hours
#2 1 15 2017-12-31 06:00:00 6.0 hours
#3 1 15 2017-12-31 12:30:00 6.5 hours
#4 2 11 2018-01-01 07:00:00 1.5 hours
#5 2 11 2018-01-01 13:00:00 6.0 hours
#6 2 11 2018-01-01 16:30:00 3.5 hours
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.