繁体   English   中英

如何找到R中不同行中的两个日期之间的差异?

[英]How to find the difference between two dates lying in different rows in R?

我有一个类似于以下的数据框,其中包含我需要找出访问次数的日期。 但是,条件是,对于1个唯一ID,如果降序后第一行的enddt与下一行的strdt之差<2,则应将其视为1次访问。

数据

 id      strdt         enddt    
 ep01    2017-06-23    2017-06-24  
 ep01    2017-06-28    2017-06-30
 ep01    2017-06-25    2017-06-26
 ep02    2017-05-06    2017-05-10
 ep02    2017-05-12    2017-05-14
 ep02    2017-05-15    2017-05-16  
 ep03    2017-05-15    2017-05-16
 ep04    2017-05-15    2017-05-17 

预期产量:

id     strdt         enddt  
ep01   2017-06-23    2017-06-26
ep01   2017-06-28    2017-06-30
ep02   2017-05-06    2017-05-10
ep02   2017-05-12    2017-05-16 
ep03   2017-05-15    2017-05-16
ep04   2017-05-15    2017-05-17

试着

data = read.csv("data.csv",header = T,stringsAsFactors = F)
unique_id = unique(data$id)
id_data = NULL
for (i in 1: length(unique_id)){
id_data = data[data$id == unique_id[i],]  
id_data = id_data[ order(id_data$strdt , decreasing = F ),]
id_data = ifelse(id_data$enddt - id_data$str_dt < 1, id_data$enddt[2,3],id_data$enddt)   
 }

我尝试使用上面的代码,但是我做不到。 提前致谢。

dplyr lead函数可能对您的问题有所帮助。 https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/lead-lag

我还没有创建一个可以正常工作的解决方案,但是可以从以下代码中推断出逻辑

library("dplyr")
dat <- data.frame(id <- c("ep01", "ep01", "ep01", "ep02", "ep02", "ep02", "ep03", "ep04"),
                   startdt <- as.Date(c("2017-06-23", "2017-06-28", "2017-06-25", "2017-05-06", "2017-05-12", "2017-05-15", "2017-05-15", "2017-05-15")),
                   enddt <- as.Date(c("2017-06-24", "2017-06-30", "2017-06-26", "2017-05-10", "2017-05-14", "2017-05-16", "2017-05-16", "2017-05-17"))
)

colnames(dat) <- c("id", "startdt", "enddt")


# get next start date, you can use dplyr::group_by() to get next start date for each id
dat$start_lead <- lead(dat$startdt)

# calculate difference between next start date and current end date, if diff < 2, then reject otherwise accept
dat$is_less_thn_2 <- ifelse(dat$start_lead - dat$enddt < 2, 0, 1)

# get next diff value
dat$take_enddt_value <- lead(dat$is_less_thn_2)

# This part won't compile
for(i in 1:nrow(dat)) {
  # if take_enddt_value is 0, iterate until take_enddt_value is 1, set current enddt value to enddt with take_enddt_value = 1
  if (dat[i, "take_enddt_value"] == 0){
    k = i
    while(dat[k, "take_enddt_value"] == 0){
      k = k + 1
    }
    dat[i, "enddt"] <- dat[k, "enddt"]
  }
}

另一种方法是对行进行分组,将这些行组合起来以计算开始和结束日期。 注意最终group_by语句之前的flag

library(dplyr)
library(data.table)

df %>%
  arrange(id, strdt) %>%
  group_by(id) %>%
  mutate(flag = as.numeric(strdt - lag(enddt, order_by = id, default = first(strdt)))) %>%
  mutate(flag = rleid(ifelse((flag < 2 & row_number() != 1) | lead(flag, order_by = id, default = 9999) < 2, 
                             9999, 
                             row_number()))) %>%  #final grouping happened here
  group_by(id, flag) %>%
  summarise(strdt = first(strdt),
            enddt = last(enddt)) %>%
  select(-flag)

输出为:

  id    strdt      enddt     
1 ep01  2017-06-23 2017-06-26
2 ep01  2017-06-28 2017-06-30
3 ep02  2017-05-06 2017-05-10
4 ep02  2017-05-12 2017-05-16
5 ep03  2017-05-15 2017-05-16
6 ep04  2017-05-15 2017-05-17

样本数据:

df <- structure(list(id = c("ep01", "ep01", "ep01", "ep02", "ep02", 
"ep02", "ep03", "ep04"), strdt = structure(c(17340, 17345, 17342, 
17292, 17298, 17301, 17301, 17301), class = "Date"), enddt = structure(c(17341, 
17347, 17343, 17296, 17300, 17302, 17302, 17303), class = "Date")), .Names = c("id", 
"strdt", "enddt"), row.names = c(NA, -8L), class = "data.frame")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM