[英]How to compare two rows of dates in R and transform into one row
我有一個數據集,其中包含相同名稱的多行日期信息(間隔),應該對其進行比較並最終轉換為一行。 我想實現以下目標:
數據:
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)
期望的結果:
names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "16-4-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)
轉換日期:
df1$date1 <- as.Date(df1$date1, "%d-%m-%Y")
df1$date2 <- as.Date(df1$date2, "%d-%m-%Y")
這是使用dplyr
的一種方法(可能不是最簡潔的)。 首先我們將日期轉換為Date
格式,然后為每個名稱
keep_both
。 我們對日期進行了排序,因此我們知道第二行稍后出現。keep_both
,獲取最小和最大日期。 請注意,我假設間隔的順序正確,即此處每一行中的date2
都比date1
晚。Output 與您想要的 output 匹配,除了 Rick 的錯字。
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names, date1, date2)
library(tidyverse)
df1 %>%
mutate(across(c(date1, date2), lubridate::dmy)) %>%
arrange(names, date1, date2) %>%
group_by(names) %>%
mutate(
keep_both = any((date1 - lag(date2)) > 60, na.rm = TRUE),
new_date1 = if_else(keep_both, date1, min(date1)),
new_date2 = if_else(keep_both, date2, max(date2)),
) %>%
filter(keep_both | row_number() == 1) %>%
select(names, date1 = new_date1, date2 = new_date2)
#> # A tibble: 5 x 3
#> # Groups: names [4]
#> names date1 date2
#> <chr> <date> <date>
#> 1 Harry 2018-08-27 2019-06-27
#> 2 Harry 2020-02-04 2020-04-08
#> 3 John 2016-03-01 2020-04-16
#> 4 Katie 2019-01-05 2020-04-10
#> 5 Rick 2018-01-13 2020-03-02
由reprex package (v0.3.0) 於 2020 年 7 月 13 日創建
我使用了稍微改變的樣本數據,以確保彼此相隔 <= 60 天的間隔按照問題中的描述加入..
樣本數據
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "28-4-2020", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "28-5-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)
names date1 date2
1 John 1-3-2016 16-4-2020
2 John 28-4-2020 28-5-2020 # !! <-- altered so interval-gap with line 1 <= 60 days
3 Rick 13-1-2018 2-3-2020
4 Rick 4-2-2020 16-2-2020
5 Katie 5-1-2019 25-2-2020
6 Katie 29-1-2020 10-4-2020
7 Harry 27-8-2018 27-6-2019
8 Harry 4-2-2020 8-4-2020
names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("28-5-2020", "2-3-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)
names date1 date2
1 John 1-3-2016 28-5-2020 # !! <-- joined, since gap <= 60 days
2 Rick 13-1-2018 2-3-2020 # !! <-- fixed type in your sample data provided
3 Katie 5-1-2019 10-4-2020
4 Harry 27-8-2018 27-6-2019
5 Harry 4-2-2020 8-4-2020
代碼
該代碼使用data.table
和intervals
- 包。 由於intervals
僅適用於數字(整數或實數)間隔,因此日期列在間隔創建/擴展/合並之前轉換為數字,並在處理后返回日期格式。
下面的代碼做了什么:
lapply()
在拆分的 cuncks 上(使用data.table::split()
和by-argument
. ,按名稱keep.by = FALSE
被使用,因為我們不需要它,並且也存儲了 by-name在創建列表的名稱中。gap
變量值的一半(在下面的代碼中設置為 60)。 因此,每個間隔都會在前后延長 30 天。data.table::rbindlist()
將所有結果重新合並在一起。.
library( data.table )
library( intervals )
#set maximum gap between intervals
gap = 60
#set data to data.table format
setDT(df1)
#set dates to numeric (required by the intervals-package)
df1[, c("date1", "date2") := lapply( .SD, as.numeric ), .SDcols = c("date1", "date2") ]
#where the magic happens (see text above for explanation )
ans <- data.table::rbindlist(
lapply( split( df1 , by = "names", keep.by = FALSE ), function(x) {
as.data.table(
intervals::close_intervals( intervals::contract( intervals::reduce( intervals::expand(
intervals::Intervals( x, type = "Z" ),
gap/2 ) ), gap/2 )
)
)
}),
use.names = TRUE, idcol = "name" )
#use names from df1
setnames( ans, names(ans), names(df1) )
#set numeric back to date
ans[, c("date1", "date2") := lapply( .SD, as.Date, origin = "1970-01-01" ), .SDcols = c("date1", "date2") ]
output
names date1 date2
1: John 2016-03-01 2020-05-28
2: Rick 2018-01-13 2020-03-02
3: Katie 2019-01-05 2020-04-10
4: Harry 2018-08-27 2019-06-27
5: Harry 2020-02-04 2020-04-08
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.