I use R 3.5.0 on Windows 10.
I have a dataframe which is imported using library(openxls)
read.xls("....")
. It has 100 000 rows and part of it looks like
S.No Start.DateEnd.Date Generation unitout timediff
7850 42907.76 42907.77 436.158469 INSERVICE 15
7851 42907.77 42907.78 443.302793 INSERVICE 15
7852 42907.78 42907.79 437.728683 INSERVICE 15
7853 42907.79 42907.80 390.832887 INSERVICE 15
7854 42907.80 42907.81 338.917658 INSERVICE 15
7855 42907.81 42907.82 300.056018 INSERVICE 15
7856 42907.82 42907.83 266.430064 INSERVICE 15
7857 42907.83 42907.84 248.952525 INSERVICE 15
7858 42907.84 42907.85 212.913333 INSERVICE 15
7859 42907.85 42907.86 18.523060 INSERVICE 15
7860 42907.86 42907.88 1.355428 OUTOFSERVICE 15
7861 42907.88 42907.89 1.355428 OUTOFSERVICE 15
7862 42907.89 42907.90 1.355428 OUTOFSERVICE 15
7863 42907.90 42907.91 1.355428 OUTOFSERVICE 15
7864 42907.91 42907.92 1.355428 OUTOFSERVICE 15
7865 42907.92 42907.93 1.355428 OUTOFSERVICE 15
7866 42907.93 42907.94 1.355428 OUTOFSERVICE 15
7867 42907.94 42907.95 1.355428 OUTOFSERVICE 15
7868 42907.95 42907.96 1.355428 OUTOFSERVICE 15
7869 42907.96 42907.97 1.355428 OUTOFSERVICE 15
7870 42907.97 42907.98 1.355428 OUTOFSERVICE 15
I would like to summarise this to give me a dataframe of the form
1 DateTime1(42907.76) DateTime2(42907.86) INSERVICE TIMEDIFF
2 DateTime2(42907.86) DateTime3(42907.98) OUTOFSERVICE TIMEDIFF
3 DateTime3(42907.98) DateTime4(...) INSERVICE TIMEDIFF
where every time the status changes from INSERVICE to OUTOFSERVICE it captures the start date and end date. Basically I want to know from which date+time to which date+time it was in service and out of service summarised in a data frame. In the above example DateTime1 would be 42907.76 and DateTime2 would be 42907.86 since after that it goes out of service. Similarly DateTime2 would be 42907.86 to 42907.98.So on..
I have tried creating a flag to solve it but I wasn't able to create the data frame so I did not attach the code here. My preference would be to use an easy to understand solution with good logic than using packages which do everything in the backend.
PS An additional problem is conversion of Excel Time format to standard %Y%m%D%H%M format. I have read multiple threads on SO and I have tried doing as.posixCT, as.date etc. but either it changes to NA or throws an error.
using dplyr
We create a lag of unitout and use this to create an ID on which we can group afterwards
library(dplyr)
df$id <- cumsum(as.integer(df$unitout != lag(df$unitout, n = 1, default=1)))
df %>% group_by(id, unitout) %>% summarise("Start" = min(Start.Date), "End" = max(End.Date))
You can convert your dates the following way:
as.Date(42907.76, origin = "1899-12-30")
as.Date(42907.76, origin = "1904-01-01")
data :
df <- read_table(
"S.No Start.Date End.Date Generation unitout timediff
7850 42907.76 42907.77 436.158469 INSERVICE 15
7851 42907.77 42907.78 443.302793 INSERVICE 15
7852 42907.78 42907.79 437.728683 INSERVICE 15
7853 42907.79 42907.80 390.832887 INSERVICE 15
7854 42907.80 42907.81 338.917658 INSERVICE 15
7855 42907.81 42907.82 300.056018 INSERVICE 15
7856 42907.82 42907.83 266.430064 INSERVICE 15
7857 42907.83 42907.84 248.952525 INSERVICE 15
7858 42907.84 42907.85 212.913333 INSERVICE 15
7859 42907.85 42907.86 18.523060 INSERVICE 15
7860 42907.86 42907.88 1.355428 OUTOFSERVICE 15
7861 42907.88 42907.89 1.355428 OUTOFSERVICE 15
7862 42907.89 42907.90 1.355428 OUTOFSERVICE 15
7863 42907.90 42907.91 1.355428 OUTOFSERVICE 15
7864 42907.91 42907.92 1.355428 OUTOFSERVICE 15
7865 42907.92 42907.93 1.355428 OUTOFSERVICE 15
7866 42907.93 42907.94 1.355428 OUTOFSERVICE 15
7867 42907.94 42907.95 1.355428 OUTOFSERVICE 15
7868 42907.95 42907.96 1.355428 OUTOFSERVICE 15
7869 42907.96 42907.97 1.355428 OUTOFSERVICE 15
7870 42907.97 42907.98 1.355428 OUTOFSERVICE 15")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.