[英]How to compare two rows of dates in R and transform into one row
I have a data set which contains multiple rows of date information (intervals) for the same names, which should be compared and eventually transformed into one row.我有一个数据集,其中包含相同名称的多行日期信息(间隔),应该对其进行比较并最终转换为一行。 I'd like to achieve the following:
我想实现以下目标:
Data:数据:
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)
Desired result:期望的结果:
names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "16-4-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)
Transform the dates:转换日期:
df1$date1 <- as.Date(df1$date1, "%d-%m-%Y")
df1$date2 <- as.Date(df1$date2, "%d-%m-%Y")
Here's one way (probably not the most concise) using dplyr
.这是使用
dplyr
的一种方法(可能不是最简洁的)。 First we convert the dates to Date
format, then for each name首先我们将日期转换为
Date
格式,然后为每个名称
keep_both
.keep_both
。 We sorted the dates so we know the second row comes later.keep_both
, get the min and max dates.keep_both
,获取最小和最大日期。 Note that I'm assuming the intervals are in the right order, ie date2
is later than date1
in each row here.date2
都比date1
晚。 Output matches your desired output except for a typo on Rick. Output 与您想要的 output 匹配,除了 Rick 的错字。
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names, date1, date2)
library(tidyverse)
df1 %>%
mutate(across(c(date1, date2), lubridate::dmy)) %>%
arrange(names, date1, date2) %>%
group_by(names) %>%
mutate(
keep_both = any((date1 - lag(date2)) > 60, na.rm = TRUE),
new_date1 = if_else(keep_both, date1, min(date1)),
new_date2 = if_else(keep_both, date2, max(date2)),
) %>%
filter(keep_both | row_number() == 1) %>%
select(names, date1 = new_date1, date2 = new_date2)
#> # A tibble: 5 x 3
#> # Groups: names [4]
#> names date1 date2
#> <chr> <date> <date>
#> 1 Harry 2018-08-27 2019-06-27
#> 2 Harry 2020-02-04 2020-04-08
#> 3 John 2016-03-01 2020-04-16
#> 4 Katie 2019-01-05 2020-04-10
#> 5 Rick 2018-01-13 2020-03-02
Created on 2020-07-13 by the reprex package (v0.3.0)由reprex package (v0.3.0) 于 2020 年 7 月 13 日创建
I used slightly altered sample data, to make sure intervals that are <= 60 days apart from each other, get joined as described in the question..我使用了稍微改变的样本数据,以确保彼此相隔 <= 60 天的间隔按照问题中的描述加入..
sample data样本数据
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "28-4-2020", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "28-5-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)
names date1 date2
1 John 1-3-2016 16-4-2020
2 John 28-4-2020 28-5-2020 # !! <-- altered so interval-gap with line 1 <= 60 days
3 Rick 13-1-2018 2-3-2020
4 Rick 4-2-2020 16-2-2020
5 Katie 5-1-2019 25-2-2020
6 Katie 29-1-2020 10-4-2020
7 Harry 27-8-2018 27-6-2019
8 Harry 4-2-2020 8-4-2020
names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("28-5-2020", "2-3-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)
names date1 date2
1 John 1-3-2016 28-5-2020 # !! <-- joined, since gap <= 60 days
2 Rick 13-1-2018 2-3-2020 # !! <-- fixed type in your sample data provided
3 Katie 5-1-2019 10-4-2020
4 Harry 27-8-2018 27-6-2019
5 Harry 4-2-2020 8-4-2020
code代码
The code makes use of the data.table
and the intervals
-packages.该代码使用
data.table
和intervals
- 包。 Since intervals
only work on numeric (integer or real) intervals, the date-columns get converted to numeric before interval-creation/-extension/-merging, and back to date-format after processing.由于
intervals
仅适用于数字(整数或实数)间隔,因此日期列在间隔创建/扩展/合并之前转换为数字,并在处理后返回日期格式。
What the code below does:下面的代码做了什么:
lapply()
over cuncks of split (using data.table::split()
with the by-argument
. , by name. keep.by = FALSE
is used, since we do not need it, and the by-name is also stored in the names of the created list.lapply()
在拆分的 cuncks 上(使用data.table::split()
和by-argument
. ,按名称keep.by = FALSE
被使用,因为我们不需要它,并且也存储了 by-name在创建列表的名称中。gap
-variable (set to 60 in the code below).gap
变量值的一半(在下面的代码中设置为 60)。 So each interval gets extended by 30 days before and after.data.table::rbindlist()
, to merge all the results back together.data.table::rbindlist()
将所有结果重新合并在一起。. .
library( data.table )
library( intervals )
#set maximum gap between intervals
gap = 60
#set data to data.table format
setDT(df1)
#set dates to numeric (required by the intervals-package)
df1[, c("date1", "date2") := lapply( .SD, as.numeric ), .SDcols = c("date1", "date2") ]
#where the magic happens (see text above for explanation )
ans <- data.table::rbindlist(
lapply( split( df1 , by = "names", keep.by = FALSE ), function(x) {
as.data.table(
intervals::close_intervals( intervals::contract( intervals::reduce( intervals::expand(
intervals::Intervals( x, type = "Z" ),
gap/2 ) ), gap/2 )
)
)
}),
use.names = TRUE, idcol = "name" )
#use names from df1
setnames( ans, names(ans), names(df1) )
#set numeric back to date
ans[, c("date1", "date2") := lapply( .SD, as.Date, origin = "1970-01-01" ), .SDcols = c("date1", "date2") ]
output output
names date1 date2
1: John 2016-03-01 2020-05-28
2: Rick 2018-01-13 2020-03-02
3: Katie 2019-01-05 2020-04-10
4: Harry 2018-08-27 2019-06-27
5: Harry 2020-02-04 2020-04-08
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.