[英]R: Converting consecutive dates from a single column into a 2-column range
我試圖弄清楚如何組合具有單列日期的行,以便新表/數據框/tibble 將有兩列:一列用於開始日期,另一列用於結束日期,但僅用於連續日期(即日期中的任何間隔都應在新表中的新行中分開)。 它也將按不同的分類進行分組。
我正在處理的數據類型的一個例子如下:
Person ID Department Date
351581 JE 12/1/2019
351581 JE 12/2/2019
351581 FR 12/2/2019
351581 JE 12/3/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019
615418 AB 12/22/2019
期望的結果是:
Person ID Department Start Date End Date
351581 JE 12/1/2019 12/3/2019
351581 FR 12/2/2019 12/2/2019
598168 GH 12/16/2019 12/16/2019
351581 JE 12/8/2019 12/9/2019
615418 AB 12/20/2019 12/20/2019
615418 AB 12/22/2019 12/22/2019
到目前為止,我的搜索發現了幾個可能相關的問題,涉及組合日期范圍,但我不確定如何將它們應用於單列日期:
dplyr
為了將來的人的利益而添加這個,我最終用 dplyr 應用了公認的解決方案,只是因為我對語法更熟悉。
df %>%
mutate(Date = as.Date(Date)) %>%
arrange(`Person ID`, Department, Date) %>%
group_by(`Person ID`, Department,
g = cumsum(c(0, diff(Date)) != 1)
) %>%
summarize(Start = min(Date), End = max(Date)) %>%
ungroup %>%
select(-g)
我們在這里假設所詢問的是在每個連續的 Person_ID 和 Departmwent 組中,我們需要最小和最大日期。
1) data.table先將Date
columnn轉換為Date
類,然后按rleid(Person_ID)
分組取最小值和最大值。
library(data.table)
library(lubridate)
DT <- as.data.table(DF0)
DT[, Date := mdy(Date)][
, list(start = min(Date), end = max(Date)),
by = .(rleid(Person_ID, Department), Person_ID, Department)][-1]
給予:
Person_ID Department start end
1: 351581 GH 2019-12-01 2019-12-03
2: 351581 FR 2019-12-02 2019-12-02
3: 598168 GH 2019-12-16 2019-12-16
4: 351581 JE 2019-12-08 2019-12-09
5: 615418 AB 2019-12-20 2019-12-20
2) Base R將Date
轉換為Date
類,然后使用rle
創建分組變量g
。 然后定義一個Range
函數,該函數輸出給定組的start
和end
並將其應用於每個組。
DF <- transform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
g <- with(rle(paste(DF$Person_ID, DF$Department)), rep(seq_along(lengths), lengths))
Range <- function(x) data.frame(x[1, 1:2], start = min(x$Date), end = max(x$Date))
do.call("rbind", by(DF, g, Range))
給予:
Person_ID Department start end
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
3)dplyr在這里我們使用/ data.table的混合方法rleid
從data.table和以其他方式使用dplyr如下。 使用 lubridate 轉換日期,並通過 rleid 和、Person_ID 和部門轉換組。 最后兩個是確保它們包含在輸出中。 計算開始和結束,然后刪除分組列。
library(dplyr)
library(data.table)
library(lubridate)
DF0 %>%
mutate(Date = mdy(Date)) %>%
group_by(g = rleid(Person_ID, Department), Person_ID, Department) %>%
summarize(start = min(Date), end = max(Date)) %>%
ungroup %>%
select(-g)
給予:
# A tibble: 5 x 4
Person_ID Department start end
<int> <fct> <date> <date>
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
4) sqldf在內部選擇中定義組Grp
,然后通過Grp
找到最小和最大日期。
library(sqldf)
DF <- trnsform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
sqldf("select Person_ID, Department, min(Date) as start__Date, max(Date) as end__Date
from ( select
rowid r,
Person_ID,
Department,
Date,
Date - dense_rank() over (partition by Person_ID, Department order by rowid) as Grp
from DF
) group by Grp order by r", method = "name__class")
給予:
Person_ID Department start end
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
假設輸入為:
Lines <- "Person_ID Department Date
351581 GH 12/1/2019
351581 GH 12/2/2019
351581 GH 12/3/2019
351581 FR 12/2/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019"
DF0 <- read.table(text = Lines, header = TRUE)
在這里,我正在檢查與前一個日期 ( diff(Date)
) 的diff(Date)
是否不是 1。如果是,則開始一個新組(取該指標的 cumsum 意味着g
將在其為TRUE
時增加 1)。
library(data.table)
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')]
df[, .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]
# Person_ID Department g start end
# 1: 351581 GH 1 2019-12-01 2019-12-03
# 2: 351581 FR 2 2019-12-02 2019-12-02
# 3: 598168 GH 3 2019-12-16 2019-12-16
# 4: 351581 JE 4 2019-12-08 2019-12-09
# 5: 615418 AB 5 2019-12-20 2019-12-20
# 6: 615418 AB 6 2019-12-22 2019-12-22
如果您的數據尚未在 (Person_ID, Department) 組內按日期排序,您可以將order(Date)
添加到df[i, j, k]
的i
部分,即將上面的代碼更改為
df[order(Date), .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]
請注意,對於這個更新的示例,這與按 Person_ID 和 Department 分組不同
df[, .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department)]
# Person_ID Department start end
# 1: 351581 GH 2019-12-01 2019-12-03
# 2: 351581 FR 2019-12-02 2019-12-02
# 3: 598168 GH 2019-12-16 2019-12-16
# 4: 351581 JE 2019-12-08 2019-12-09
# 5: 615418 AB 2019-12-20 2019-12-22
使用的數據:
df <- fread('
Person_ID Department Date
351581 GH 12/1/2019
351581 GH 12/2/2019
351581 GH 12/3/2019
351581 FR 12/2/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019
615418 AB 12/22/2019
')
假設您已經過濾掉了有間隙的數據,這在我看來是一個非常干凈的解決方案。 是你要找的帽子嗎?
require(dplyr)
df <- tibble::tribble(~`Person ID`, ~`Department`, ~`Date`,
"351581" , "GH", as.Date("12/1/2019", format = "%m/%d/%y"),
"351581" , "GH", as.Date("12/2/2019", format = "%m/%d/%y"),
"351581" , "GH", as.Date("12/3/2019", format = "%m/%d/%y"),
"351581" , "FR", as.Date("12/2/2019", format = "%m/%d/%y"),
"598168" , "GH", as.Date("12/16/2019", format = "%m/%d/%y"),
"351581" , "JE", as.Date("12/8/2019", format = "%m/%d/%y"),
"351581" , "JE", as.Date("12/9/2019", format = "%m/%d/%y"),
"615418" , "AB", as.Date("12/20/2019", format = "%m/%d/%y"))
df %>%
group_by(`Person ID`, Department) %>%
summarise(`Start Date` = min(Date),
`End Date` = max(Date)) %>%
ungroup()
#> # A tibble: 5 x 4
#> `Person ID` Department `Start Date` `End Date`
#> <chr> <chr> <date> <date>
#> 1 351581 FR 2020-12-02 2020-12-02
#> 2 351581 GH 2020-12-01 2020-12-03
#> 3 351581 JE 2020-12-08 2020-12-09
#> 4 598168 GH 2020-12-16 2020-12-16
#> 5 615418 AB 2020-12-20 2020-12-20
假設您在data.frame
上有數據,您可以通過Pearson_id
和Department
實現結果分組:
library(dplyr)
data %>%
group_by(`Person ID`, Department) %>%
summarise(`Start Date` = min(as.Date(Date, format = "%m/%d/%Y")),
`End Date` = max(as.Date(Date, format = "%m/%d/%Y")))
輸出將是:
# A tibble: 5 x 4
# Groups: Person_id [3]
Person ID Department `Start Date` `End Date`
<int> <fct> <date> <date>
1 351581 FR 2019-12-02 2019-12-02
2 351581 GH 2019-12-01 2019-12-03
3 351581 JE 2019-12-08 2019-12-09
4 598168 GH 2019-12-16 2019-12-16
5 615418 AB 2019-12-20 2019-12-20
希望這有幫助。
這是一個基本的 R 解決方案
dfout <- do.call(rbind,
c(lapply(split(df,cut(1:nrow(df),c(0,cumsum(rle(df$Department)$lengths)))),
function(x) data.frame(unique(x[-3]),
`Start Date` = head(x[,3],1),
`End Date` = tail(x[,3],1))),
make.row.names = F)
)
以至於
> dfout
Person.ID Department Start.Date End.Date
1 351581 GH 12/1/2019 12/3/2019
2 351581 FR 12/2/2019 12/2/2019
3 598168 GH 12/16/2019 12/16/2019
4 351581 JE 12/8/2019 12/9/2019
5 615418 AB 12/20/2019 12/20/2019
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.