簡體   English   中英

R:將單列中的連續日期轉換為 2 列范圍

[英]R: Converting consecutive dates from a single column into a 2-column range

我試圖弄清楚如何組合具有單列日期的行,以便新表/數據框/tibble 將有兩列:一列用於開始日期,另一列用於結束日期,但僅用於連續日期(即日期中的任何間隔都應在新表中的新行中分開)。 它也將按不同的分類進行分組。

我正在處理的數據類型的一個例子如下:

   Person ID   Department   Date     
   351581      JE           12/1/2019
   351581      JE           12/2/2019
   351581      FR           12/2/2019
   351581      JE           12/3/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019
   615418      AB           12/22/2019

期望的結果是:

   Person ID   Department   Start Date      End Date
   351581      JE           12/1/2019       12/3/2019
   351581      FR           12/2/2019       12/2/2019
   598168      GH           12/16/2019      12/16/2019
   351581      JE           12/8/2019       12/9/2019
   615418      AB           12/20/2019      12/20/2019
   615418      AB           12/22/2019      12/22/2019

到目前為止,我的搜索發現了幾個可能相關的問題,涉及組合日期范圍,但我不確定如何將它們應用於單列日期:

在 R 中查找重疊開始和結束日期的所有日期范圍

R 中的日期匯總

dplyr

為了將來的人的利益而添加這個,我最終用 dplyr 應用了公認的解決方案,只是因為我對語法更熟悉。

df %>%
  mutate(Date = as.Date(Date)) %>%
  arrange(`Person ID`, Department, Date) %>%
  group_by(`Person ID`, Department, 
           g = cumsum(c(0, diff(Date)) != 1)
           ) %>%
  summarize(Start = min(Date), End = max(Date)) %>%
  ungroup %>%
  select(-g)

我們在這里假設所詢問的是在每個連續的 Person_ID 和 Departmwent 組中,我們需要最小和最大日期。

1) data.table先將Date columnn轉換為Date類,然后按rleid(Person_ID)分組取最小值和最大值。

library(data.table)
library(lubridate)

DT <- as.data.table(DF0)
DT[, Date := mdy(Date)][
   , list(start = min(Date), end = max(Date)), 
   by = .(rleid(Person_ID, Department), Person_ID, Department)][-1]

給予:

   Person_ID Department      start        end
1:    351581         GH 2019-12-01 2019-12-03
2:    351581         FR 2019-12-02 2019-12-02
3:    598168         GH 2019-12-16 2019-12-16
4:    351581         JE 2019-12-08 2019-12-09
5:    615418         AB 2019-12-20 2019-12-20

2) Base RDate轉換為Date類,然后使用rle創建分組變量g 然后定義一個Range函數,該函數輸出給定組的startend並將其應用於每個組。

DF <- transform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
g <- with(rle(paste(DF$Person_ID, DF$Department)), rep(seq_along(lengths), lengths))
Range <- function(x) data.frame(x[1, 1:2], start = min(x$Date), end = max(x$Date))
do.call("rbind", by(DF, g, Range))

給予:

  Person_ID Department      start        end
1    351581         GH 2019-12-01 2019-12-03
2    351581         FR 2019-12-02 2019-12-02
3    598168         GH 2019-12-16 2019-12-16
4    351581         JE 2019-12-08 2019-12-09
5    615418         AB 2019-12-20 2019-12-20

3)dplyr在這里我們使用/ data.table的混合方法rleid從data.table和以其他方式使用dplyr如下。 使用 lubridate 轉換日期,並通過 rleid 和、Person_ID 和部門轉換組。 最后兩個是確保它們包含在輸出中。 計算開始和結束,然后刪除分組列。

library(dplyr)
library(data.table)
library(lubridate)

DF0 %>%
  mutate(Date = mdy(Date)) %>%
  group_by(g = rleid(Person_ID, Department), Person_ID, Department) %>%
  summarize(start = min(Date), end = max(Date)) %>%
  ungroup %>%
  select(-g)

給予:

# A tibble: 5 x 4
  Person_ID Department start      end       
      <int> <fct>      <date>     <date>    
1    351581 GH         2019-12-01 2019-12-03
2    351581 FR         2019-12-02 2019-12-02
3    598168 GH         2019-12-16 2019-12-16
4    351581 JE         2019-12-08 2019-12-09
5    615418 AB         2019-12-20 2019-12-20

4) sqldf在內部選擇中定義組Grp ,然后通過Grp找到最小和最大日期。

library(sqldf)

DF <- trnsform(DF0, Date = as.Date(Date, "%m/%d/%Y"))

sqldf("select Person_ID, Department, min(Date) as start__Date, max(Date) as end__Date
from ( select 
    rowid r, 
    Person_ID, 
    Department, 
    Date, 
    Date - dense_rank() over (partition by Person_ID, Department order by rowid) as Grp
  from DF
) group by Grp order by r", method = "name__class")

給予:

  Person_ID Department      start        end
1    351581         GH 2019-12-01 2019-12-03
2    351581         FR 2019-12-02 2019-12-02
3    598168         GH 2019-12-16 2019-12-16
4    351581         JE 2019-12-08 2019-12-09
5    615418         AB 2019-12-20 2019-12-20

筆記

假設輸入為:

Lines <- "Person_ID   Department   Date     
   351581      GH           12/1/2019
   351581      GH           12/2/2019
   351581      GH           12/3/2019
   351581      FR           12/2/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019"

DF0 <- read.table(text = Lines, header = TRUE)

在這里,我正在檢查與前一個日期 ( diff(Date) ) 的diff(Date)是否不是 1。如果是,則開始一個新組(取該指標的 cumsum 意味着g將在其為TRUE時增加 1)。

library(data.table)
setDT(df)

df[, Date := as.Date(Date, format = '%m/%d/%Y')]


df[, .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]

#    Person_ID Department g      start        end
# 1:    351581         GH 1 2019-12-01 2019-12-03
# 2:    351581         FR 2 2019-12-02 2019-12-02
# 3:    598168         GH 3 2019-12-16 2019-12-16
# 4:    351581         JE 4 2019-12-08 2019-12-09
# 5:    615418         AB 5 2019-12-20 2019-12-20
# 6:    615418         AB 6 2019-12-22 2019-12-22

如果您的數據尚未在 (Person_ID, Department) 組內按日期排序,您可以將order(Date)添加到df[i, j, k]i部分,即將上面的代碼更改為

df[order(Date), .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]

請注意,對於這個更新的示例,這與按 Person_ID 和 Department 分組不同

df[, .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department)]

#    Person_ID Department      start        end
# 1:    351581         GH 2019-12-01 2019-12-03
# 2:    351581         FR 2019-12-02 2019-12-02
# 3:    598168         GH 2019-12-16 2019-12-16
# 4:    351581         JE 2019-12-08 2019-12-09
# 5:    615418         AB 2019-12-20 2019-12-22

使用的數據:

df <- fread('
   Person_ID   Department   Date     
   351581      GH           12/1/2019
   351581      GH           12/2/2019
   351581      GH           12/3/2019
   351581      FR           12/2/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019
  615418      AB           12/22/2019
')

假設您已經過濾掉了有間隙的數據,這在我看來是一個非常干凈的解決方案。 是你要找的帽子嗎?


require(dplyr)

df <- tibble::tribble(~`Person ID`, ~`Department`,    ~`Date`,
                      "351581"    ,          "GH", as.Date("12/1/2019", format = "%m/%d/%y"),
                      "351581"    ,          "GH", as.Date("12/2/2019", format = "%m/%d/%y"),
                      "351581"    ,          "GH", as.Date("12/3/2019", format = "%m/%d/%y"),
                      "351581"    ,          "FR", as.Date("12/2/2019", format = "%m/%d/%y"),
                      "598168"    ,          "GH", as.Date("12/16/2019", format = "%m/%d/%y"),
                      "351581"    ,          "JE", as.Date("12/8/2019", format = "%m/%d/%y"),
                      "351581"    ,          "JE", as.Date("12/9/2019", format = "%m/%d/%y"),
                      "615418"    ,          "AB", as.Date("12/20/2019", format = "%m/%d/%y"))

df %>%
  group_by(`Person ID`, Department) %>%
  summarise(`Start Date` = min(Date),
            `End Date` = max(Date)) %>% 
  ungroup()

#> # A tibble: 5 x 4
#>   `Person ID` Department `Start Date` `End Date`
#>   <chr>       <chr>      <date>       <date>    
#> 1 351581      FR         2020-12-02   2020-12-02
#> 2 351581      GH         2020-12-01   2020-12-03
#> 3 351581      JE         2020-12-08   2020-12-09
#> 4 598168      GH         2020-12-16   2020-12-16
#> 5 615418      AB         2020-12-20   2020-12-20

使用 dplyr

假設您在data.frame上有數據,您可以通過Pearson_idDepartment實現結果分組:

library(dplyr)
data %>%
  group_by(`Person ID`, Department) %>%
  summarise(`Start Date` = min(as.Date(Date, format = "%m/%d/%Y")), 
            `End Date` = max(as.Date(Date, format = "%m/%d/%Y")))

輸出將是:

# A tibble: 5 x 4
# Groups:   Person_id [3]
  Person ID Department `Start Date` `End Date`
      <int> <fct>      <date>       <date>    
1    351581 FR         2019-12-02   2019-12-02
2    351581 GH         2019-12-01   2019-12-03
3    351581 JE         2019-12-08   2019-12-09
4    598168 GH         2019-12-16   2019-12-16
5    615418 AB         2019-12-20   2019-12-20

希望這有幫助。

這是一個基本的 R 解決方案

dfout <- do.call(rbind,
                 c(lapply(split(df,cut(1:nrow(df),c(0,cumsum(rle(df$Department)$lengths)))), 
                          function(x) data.frame(unique(x[-3]),
                                                 `Start Date` = head(x[,3],1),
                                                 `End Date` = tail(x[,3],1))),
                   make.row.names = F)
                 )

以至於

> dfout
  Person.ID Department Start.Date   End.Date
1    351581         GH  12/1/2019  12/3/2019
2    351581         FR  12/2/2019  12/2/2019
3    598168         GH 12/16/2019 12/16/2019
4    351581         JE  12/8/2019  12/9/2019
5    615418         AB 12/20/2019 12/20/2019

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM