[英]Group consecutive dates in R
我正在尋找一個 R 包來將連續的日期分組到句點中。 此外,列必須按 FID、PID 和 SETTING 分組:
# Input data
input <- read.csv(text=
"FID,PID,SETTING,DATE
00001, 100001, ST, 2021-01-01
00001, 100001, ST, 2021-01-02
00001, 100001, ST, 2021-01-03
00001, 100002, AB, 2021-01-04
00001, 100001, ST, 2021-01-11
00001, 100001, ST, 2021-01-12
00002, 200001, AB, 2021-01-02
00002, 200001, AB, 2021-01-03
00002, 200001, AB, 2021-01-04
00002, 200002, TK, 2021-01-05"
)
# Expected output
output <- read.csv(text="
FID,PID,SETTING,START,END
00001, 100001, ST, 2021-01-01, 2021-01-03
00001, 100002, AB, 2021-01-04, 2021-01-04
00001, 100001, ST, 2021-01-11, 2021-01-12
00002, 200001, AB, 2021-01-02, 2021-01-04
00002, 200002, TK, 2021-01-05, 2021-01-05"
)
我必須對大約 700'000 行進行分組。 因此,該解決方案應盡可能提高性能。
input <- input[order(input$DATE),]
input$grp <- ave(as.integer(input$DATE), input[-4], FUN = function(z) cumsum(c(TRUE, diff(z) > 1)))
input
# FID PID SETTING DATE grp
# 1 1 100001 ST 2021-01-01 1
# 2 1 100001 ST 2021-01-02 1
# 7 2 200001 AB 2021-01-02 1
# 3 1 100001 ST 2021-01-03 1
# 8 2 200001 AB 2021-01-03 1
# 4 1 100002 AB 2021-01-04 1
# 9 2 200001 AB 2021-01-04 1
# 10 2 200002 TK 2021-01-05 1
# 5 1 100001 ST 2021-01-11 1
# 6 1 100001 ST 2021-01-12 1
out <- aggregate(DATE ~ FID + PID + SETTING + grp, data = input,
FUN = function(z) setNames(range(z), c("START","END")))
out <- do.call(data.frame, out)
out[,5:6] <- lapply(out[,5:6], as.Date, origin = "1970-01-01")
out
# FID PID SETTING grp DATE.START DATE.END
# 1 1 100002 AB 1 2021-01-04 2021-01-04
# 2 2 200001 AB 1 2021-01-02 2021-01-04
# 3 1 100001 ST 1 2021-01-01 2021-01-03
# 4 2 200002 TK 1 2021-01-05 2021-01-05
# 5 1 100001 ST 2 2021-01-11 2021-01-12
演練:
cumsum
和diff
的簡易性是在假設日期總是有序的情況下實現的; 其他分組變量可能被錯誤排序並不重要(這里);ave(..)
分配非連續(差異超過 1)日期的組,我們在下一步中使用;aggregate
計算每個組內的范圍,使用您的三個變量加上我們新的grp
分組變量; 匿名函數中的每個z
都是一個連續的日期向量,因此range
為我們提供了開始/結束日期;do.call(data.frame, out)
解決了這個問題;POSIXt
Date
(和POSIXt
)類,因此我們需要使用as.Date
來解決這個問題。library(dplyr)
input %>%
arrange(DATE) %>%
group_by(FID, PID, SETTING) %>%
mutate(grp = cumsum(c(TRUE, diff(DATE) > 1))) %>%
group_by(FID, PID, SETTING, grp) %>%
summarize(START = min(DATE), END = max(DATE)) %>%
ungroup()
# # A tibble: 5 x 6
# FID PID SETTING grp START END
# <int> <int> <chr> <int> <date> <date>
# 1 1 100001 " ST" 1 2021-01-01 2021-01-03
# 2 1 100001 " ST" 2 2021-01-11 2021-01-12
# 3 1 100002 " AB" 1 2021-01-04 2021-01-04
# 4 2 200001 " AB" 1 2021-01-02 2021-01-04
# 5 2 200002 " TK" 1 2021-01-05 2021-01-05
library(data.table)
inputDT <- as.data.table(input)
setorder(inputDT, DATE)
inputDT[, grp := cumsum(c(TRUE, diff(DATE) > 1)), by = .(FID, PID, SETTING)
][, .(START = min(DATE), END = max(DATE)), by = .(FID, PID, SETTING, grp)
][]
# FID PID SETTING grp START END
# <int> <int> <char> <int> <Date> <Date>
# 1: 1 100001 ST 1 2021-01-01 2021-01-03
# 2: 2 200001 AB 1 2021-01-02 2021-01-04
# 3: 1 100002 AB 1 2021-01-04 2021-01-04
# 4: 2 200002 TK 1 2021-01-05 2021-01-05
# 5: 1 100001 ST 2 2021-01-11 2021-01-12
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.