[英]For each ID, find if dates overlap and then create new dates and remove rows
[英]Find overlapping dates for each ID and create a new row for the overlap
我想找到每個ID的重疊日期,並創建一個重疊日期的新行,並組合行的字符(char)。 我的數據可能有> 2個重疊,需要> 2個字符組合。 例如。 ERM
數據:
ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R
輸出我想:
ID date1 date2 char
15 2003-04-05 2003-04-20 E
15 2003-04-20 2003-05-06 ER
15 2003-05-06 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2005-04-15 I
17 2005-04-15 2007-02-22 IC
17 2005-04-15 2007-05-15 C
17 2007-05-15 2008-02-05 CI
17 2008-02-05 2012-02-14 CM
17 2010-06-07 2011-02-14 CV
17 2010-09-22 2014-05-19 CP
17 2012-02-28 2013-03-04 CR
17 2014-05-19 2014-05-19 P
17 2010-06-07 2012-02-14 MV
17 2010-09-22 2011-02-14 VP
17 2012-02-28 2013-03-04 RP
我嘗試過:我嘗試過使用以下行從當前行中減去日期2:
df$diff <- c(NA,df[2:nrow(tdf), "date1"] - df[1:(nrow(df)-1), "date2"])
然后確定行之間的重疊:
df$overlap[which(df$diff<1)] <-1
df$overlap.up <- c(df$overlap[2:(nrow(df))], "NA")
df$overlap.final[which(df$overlap==1 | df$overlap.up==1)] <- 1
然后,我選擇了那些具有overlap.final == 1並將它們放入另一個數據幀並找到每個ID的重疊。
但是,我已經意識到這太簡單和有缺陷,因為它只選擇順序發生的重疊(使用第一步中的日期差異)。 我需要做的是為每個ID取一系列日期並循環遍歷每個組合以確定是否存在重疊然后,如果是,則記錄開始和結束日期並創建一個新字符“char”,表示什么是在這兩個日期合並。 我想我需要一個循環才能做到這一點。
我試圖創建一個循環來查找date1和date 2之間的重疊間隔
df <- df[which(!duplicated(df$ ID)),]
for (i in 1:nrow(df)) {
tmp <- length(which(df $ID[i] & (df$date1[i] >df$date1 & df$date1[i]< df$date2) | (df$date2[i] < df$date2& df$date2[i]> df$date1))) >0
df$int[i]<- tmp
}
但是這不起作用。
在確定重疊間隔后,我需要為每個新的開始和結束日期創建新行,並為表示重疊的新字符創建新行。
我嘗試識別重疊的另一個版本的循環:
for (i in 1:nrow(df)) {
if (df$ID[i]==IDs$ID){
tmp <- length(df, df$ ID[i]==IDs$ & (df$date1[i]> df$date1 & df$date1 [i]< df$date2 | df$date2[i] < df$date2 & df$date2[i]> df$date1)) >0
df$int[i]<- tmp
}
}
首先,我們為每個ID
創建所有可能間隔的data.table
。
所有可能的間隔意味着我們獲取ID
所有開始和結束日期,並將它們組合在一個排序的向量tmp
。 唯一值表示時間軸上ID
的所有給定間隔的所有可能的交叉點 (或中斷 )。 對於以后的連接,每個行以一個間隔重新排列中斷,並帶有start
和end
列:
library(data.table)
options(datatable.print.class = TRUE)
breaks <- DT[, {
tmp <- unique(sort(c(date1, date2)))
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = ID]
breaks
ID start end <int> <IDat> <IDat> 1: 15 2003-04-05 2003-04-20 2: 15 2003-04-20 2003-05-06 3: 15 2003-05-06 2003-06-20 4: 16 2001-01-02 2002-03-04 5: 17 2003-03-05 2005-04-15 6: 17 2005-04-15 2007-02-22 7: 17 2007-02-22 2007-05-15 8: 17 2007-05-15 2008-02-05 9: 17 2008-02-05 2010-06-07 10: 17 2010-06-07 2010-09-22 11: 17 2010-09-22 2011-02-14 12: 17 2011-02-14 2012-02-14 13: 17 2012-02-14 2012-02-28 14: 17 2012-02-28 2013-03-04 15: 17 2013-03-04 2014-05-19
然后,執行非等連接 ,從而在連接條件下同時聚合值( by = .EACHI
被稱為每個i的分組 ,請參閱此答案以獲得更詳細的說明):
DT[breaks, on = .(ID, date1 <= start, date2 >= end), paste(char, collapse = ""),
by = .EACHI, allow.cartesian = TRUE]
ID date1 date2 V1 <int> <IDat> <IDat> <char> 1: 15 2003-04-05 2003-04-20 E 2: 15 2003-04-20 2003-05-06 ER 3: 15 2003-05-06 2003-06-20 R 4: 16 2001-01-02 2002-03-04 M 5: 17 2003-03-05 2005-04-15 I 6: 17 2005-04-15 2007-02-22 IC 7: 17 2007-02-22 2007-05-15 C 8: 17 2007-05-15 2008-02-05 CI 9: 17 2008-02-05 2010-06-07 CM 10: 17 2010-06-07 2010-09-22 CMV 11: 17 2010-09-22 2011-02-14 CMVP 12: 17 2011-02-14 2012-02-14 CMP 13: 17 2012-02-14 2012-02-28 CP 14: 17 2012-02-28 2013-03-04 CPR 15: 17 2013-03-04 2014-05-19 CP
結果與OP發布的預期結果不同,但繪制數據表明上述結果顯示了所有可能的重疊:
library(ggplot2)
ggplot(DT) + aes(y = char, yend = char, x = date1, xend = date2) +
geom_segment() + facet_wrap("ID", ncol = 1L)
library(data.table)
DT <- fread(
"ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R"
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
您添加到問題中的for
-loop和包含的比較是一個良好的開端。 在日期比較中應該是一些額外的括號(
和)
。 這種for
-loop方法自動考慮數據幀中的新行。 因此,您可以在char
列中獲得三個,四個和更多字符的字符串。
df = as.data.frame(list('ID'=c(15, 15, 16, 17, 17, 17, 17, 17, 17, 17),
'date1'=as.Date(c('2003-04-05', '2003-04-20', '2001-01-02', '2003-03-05', '2005-04-15', '2007-05-15', '2008-02-05', '2010-06-07', '2010-09-22', '2012-02-28')),
'date2'=as.Date(c('2003-05-06', '2003-06-20', '2002-03-04', '2007-02-22', '2014-05-19', '2008-02-05', '2012-02-14', '2011-02-14', '2014-05-19', '2013-03-04')),
'char'=c('E', 'R', 'M', 'I', 'C', 'I', 'M', 'V', 'P', 'R')),
stringsAsFactors=FALSE)
迭代所有行(原始data.frame中存在的行)並將它們與所有當前行進行比較。
nrow_init = nrow(df)
for (i in 1:(nrow(df)-1)) {
print(i)
## get rows of df that have overlapping dates
## (1:nrow(df))>i :: consider only rows below the current row to avoid double processing of two row-pairs
## (!grepl(df$char[i],df$char)) :: prevent double letters
## Because we call nrow(df) each time (and not save it as a variable once in the beginning), we consider also new rows here. Therefore, we do not need the specific procedure for comparing 3 or more rows.
loc = ((1:nrow(df))>i) & (!grepl(df$char[i],df$char)) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date1>df$date1[i]) & (df$date1<df$date2[i])) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)) | ((df$date2<df$date2[i]) & (df$date2>df$date1[i])))
## Uncomment this line, if you want to compare only two rows each and not more
# loc = ((1:nrow(df))<=nrow_init) & ((1:nrow(df))>i) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)))
## proceed only of at least one duplicate row was found
if (sum(loc) > 0) {
# build new rows
# pmax and pmin do element-wise min and max calculation; df$date1[i] and df$date2[i] are automatically extended to the length of df$date1[loc] and df$date2[loc], respectively
df_append = as.data.frame(list('ID'=df$ID[loc],
'date1'=pmax(df$date1[i],df$date1[loc]),
'date2'=pmin(df$date2[i],df$date2[loc]),
'char'=paste0(df$char[i],df$char[loc])))
## append new rows
df = rbind(df, df_append)
}
}
## create a new column and sort the characters in it
## idea for sort: https://stackoverflow.com/a/5904854/4612235
df$sort_char = df$char
for (i in 1:nrow(df)) df$sort_char[i] = paste(sort(unlist(strsplit(df$sort_char[i], ""))), collapse = "")
## remove duplicates
df = df[!duplicated(df[c('ID', 'date1', 'date2', 'sort_char')]),]
## remove additional column
df$sort_char = NULL
ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R
15 2003-04-20 2003-05-06 ER
17 2005-04-15 2007-02-22 IC
17 2007-05-15 2008-02-05 CI
17 2008-02-05 2012-02-14 CM
17 2010-06-07 2011-02-14 CV
17 2010-09-22 2014-05-19 CP
17 2012-02-28 2013-03-04 CR
17 2010-06-07 2011-02-14 MV
17 2010-09-22 2012-02-14 MP
17 2010-06-07 2011-02-14 MCV
17 2010-09-22 2012-02-14 MCP
17 2010-09-22 2011-02-14 VP
17 2010-09-22 2011-02-14 VCP
17 2010-09-22 2011-02-14 VMP
17 2010-09-22 2011-02-14 VMCP
17 2012-02-28 2013-03-04 PR
17 2012-02-28 2013-03-04 PCR
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.