简体   繁体   English

如何比较R中的两行日期并转换为一行

[英]How to compare two rows of dates in R and transform into one row

I have a data set which contains multiple rows of date information (intervals) for the same names, which should be compared and eventually transformed into one row.我有一个数据集,其中包含相同名称的多行日期信息(间隔),应该对其进行比较并最终转换为一行。 I'd like to achieve the following:我想实现以下目标:

  • If the intervals are overlapping, then keep one row with the earliest and the latest date of the four values如果间隔重叠,则保留一行,其中包含四个值中最早和最晚的日期
  • If the intervals are not overlapping, but the time between intervals is less or equal to 60 days, do the same as above: thus, keep one row with the earliest and latest date of the four values如果间隔不重叠,但间隔之间的时间小于或等于 60 天,则执行相同的操作:因此,保留四个值中最早和最晚日期的一行
  • If the intervals are not overlapping, and the time between intervals is more than 60 days, do nothing (keep both rows)如果间隔不重叠,并且间隔之间的时间超过 60 天,则不执行任何操作(保留两行)

Data:数据:

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)

Desired result:期望的结果:

names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "16-4-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)

Transform the dates:转换日期:

df1$date1 <- as.Date(df1$date1, "%d-%m-%Y")
df1$date2 <- as.Date(df1$date2, "%d-%m-%Y")

Here's one way (probably not the most concise) using dplyr .这是使用dplyr的一种方法(可能不是最简洁的)。 First we convert the dates to Date format, then for each name首先我们将日期转换为Date格式,然后为每个名称

  1. figure out if the second interval starts more than 60 days after the first.确定第二个间隔是否在第一个间隔后 60 天以上开始。 If so, we tag both rows as keep_both .如果是这样,我们将两行都标记为keep_both We sorted the dates so we know the second row comes later.我们对日期进行了排序,因此我们知道第二行稍后出现。
  2. for rows that aren't marked keep_both , get the min and max dates.对于未标记的行keep_both ,获取最小和最大日期。 Note that I'm assuming the intervals are in the right order, ie date2 is later than date1 in each row here.请注意,我假设间隔的顺序正确,即此处每一行中的date2都比date1晚。
  3. filter the data to keep just the first row from each name unless we are keeping both.过滤数据以仅保留每个名称的第一行,除非我们同时保留两者。

Output matches your desired output except for a typo on Rick. Output 与您想要的 output 匹配,除了 Rick 的错字。

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names, date1, date2)

library(tidyverse)

df1 %>%
  mutate(across(c(date1, date2), lubridate::dmy)) %>%
  arrange(names, date1, date2) %>%
  group_by(names) %>%
  mutate(
    keep_both = any((date1 - lag(date2)) > 60, na.rm = TRUE),
    new_date1 = if_else(keep_both, date1, min(date1)),
    new_date2 = if_else(keep_both, date2, max(date2)),
  ) %>%
  filter(keep_both | row_number() == 1) %>%
  select(names, date1 = new_date1, date2 = new_date2)
#> # A tibble: 5 x 3
#> # Groups:   names [4]
#>   names date1      date2     
#>   <chr> <date>     <date>    
#> 1 Harry 2018-08-27 2019-06-27
#> 2 Harry 2020-02-04 2020-04-08
#> 3 John  2016-03-01 2020-04-16
#> 4 Katie 2019-01-05 2020-04-10
#> 5 Rick  2018-01-13 2020-03-02

Created on 2020-07-13 by the reprex package (v0.3.0)reprex package (v0.3.0) 于 2020 年 7 月 13 日创建

I used slightly altered sample data, to make sure intervals that are <= 60 days apart from each other, get joined as described in the question..我使用了稍微改变的样本数据,以确保彼此相隔 <= 60 天的间隔按照问题中的描述加入..

sample data样本数据

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "28-4-2020", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "28-5-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)

  names     date1     date2
1  John  1-3-2016 16-4-2020
2  John 28-4-2020 28-5-2020 # !! <-- altered so interval-gap with line 1 <= 60 days
3  Rick 13-1-2018  2-3-2020
4  Rick  4-2-2020 16-2-2020
5 Katie  5-1-2019 25-2-2020
6 Katie 29-1-2020 10-4-2020
7 Harry 27-8-2018 27-6-2019
8 Harry  4-2-2020  8-4-2020

names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("28-5-2020", "2-3-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)

  names     date1     date2
1  John  1-3-2016 28-5-2020  # !! <-- joined, since gap <= 60 days
2  Rick 13-1-2018  2-3-2020  # !! <-- fixed type in your sample data provided
3 Katie  5-1-2019 10-4-2020
4 Harry 27-8-2018 27-6-2019
5 Harry  4-2-2020  8-4-2020

code代码

The code makes use of the data.table and the intervals -packages.该代码使用data.tableintervals - 包。 Since intervals only work on numeric (integer or real) intervals, the date-columns get converted to numeric before interval-creation/-extension/-merging, and back to date-format after processing.由于intervals仅适用于数字(整数或实数)间隔,因此日期列在间隔创建/扩展/合并之前转换为数字,并在处理后返回日期格式。

What the code below does:下面的代码做了什么:

  1. Loop ( lapply() over cuncks of split (using data.table::split() with the by-argument . , by name. keep.by = FALSE is used, since we do not need it, and the by-name is also stored in the names of the created list.循环( lapply()在拆分的 cuncks 上(使用data.table::split()by-argument . ,按名称keep.by = FALSE被使用,因为我们不需要它,并且也存储了 by-name在创建列表的名称中。
  2. For each cunck (=name), defines intervals based on the two date-columns, and extend these intervals by half the value of the gap -variable (set to 60 in the code below).对于每个 cunck (=name),根据两个日期列定义间隔,并将这些间隔扩展为gap变量值的一半(在下面的代码中设置为 60)。 So each interval gets extended by 30 days before and after.因此,每个间隔都会在前后延长 30 天。
    It then joins overlapping/touching (extended) intervals, and finally removes the extensions.然后它加入重叠/接触(扩展)间隔,最后删除扩展。
  3. Use data.table::rbindlist() , to merge all the results back together.使用data.table::rbindlist()将所有结果重新合并在一起。
  4. set colnames, and convert numeric values dates back to data-format设置 colnames,并将数值转换回数据格式

. .

library( data.table )
library( intervals )
#set maximum gap between intervals
gap = 60
#set data to data.table format
setDT(df1)
#set dates to numeric (required by the intervals-package)
df1[, c("date1", "date2") := lapply( .SD, as.numeric ), .SDcols = c("date1", "date2") ]
#where the magic happens (see text above for explanation )
ans <- data.table::rbindlist(
  lapply( split( df1 , by = "names", keep.by = FALSE ), function(x) {
    as.data.table(
    intervals::close_intervals( intervals::contract( intervals::reduce( intervals::expand( 
            intervals::Intervals( x, type = "Z" ), 
            gap/2 ) ), gap/2 ) 
      )
    )
  }),
  use.names = TRUE, idcol = "name" )
#use names from df1
setnames( ans, names(ans), names(df1) )
#set numeric back to date
ans[, c("date1", "date2") := lapply( .SD, as.Date, origin = "1970-01-01" ), .SDcols = c("date1", "date2") ]

output output

   names      date1      date2
1:  John 2016-03-01 2020-05-28
2:  Rick 2018-01-13 2020-03-02
3: Katie 2019-01-05 2020-04-10
4: Harry 2018-08-27 2019-06-27
5: Harry 2020-02-04 2020-04-08

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM