![](/img/trans.png)
[英]Merge two datasets based on certain conditions, while maintaining specific columns
[英]R merge two datasets based on specific columns with added condition
Uwe 和 GKi 的答案都是正確的。 Gki 收到賞金是因為 Uwe 遲到了,但 Uwe 的解決方案運行速度大約是 15 倍
我有兩個數據集,其中包含不同患者在多個測量時刻的分數,如下所示:
df1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"),
"Days" = c(0,25,235,353,100,538),
"Score" = c(NA,2,3,4,5,6),
stringsAsFactors = FALSE)
df2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"),
"Days" = c(0,25,248,353,100,150,503),
"Score" = c(1,10,3,4,5,7,6),
stringsAsFactors = FALSE)
> df1
ID Days Score
1 patient1 0 NA
2 patient1 25 2
3 patient1 235 3
4 patient1 353 4
5 patient2 100 5
6 patient3 538 6
> df2
ID Days Score
1 patient1 0 1
2 patient1 25 10
3 patient1 248 3
4 patient1 353 4
5 patient2 100 5
6 patient2 150 7
7 patient3 503 6
ID
列顯示患者 ID, Days
列顯示測量時刻(患者納入后的天數), Score
列顯示測量得分。 兩個數據集都顯示相同的數據,但時間不同(df1 是 2 年前,df2 具有相同的數據,但從今年開始更新)。
我必須比較兩個數據集之間每個患者和每個時刻的分數。 但是,在某些情況下, Days
變量會隨着時間的推移而發生細微的變化,因此通過簡單的連接來比較數據集是行不通的。 例子:
library(dplyr)
> full_join(df1, df2, by=c("ID","Days")) %>%
+ arrange(.[[1]], as.numeric(.[[2]]))
ID Days Score.x Score.y
1 patient1 0 NA 1
2 patient1 25 2 10
3 patient1 235 3 NA
4 patient1 248 NA 3
5 patient1 353 4 4
6 patient2 100 5 5
7 patient2 150 NA 7
8 patient3 503 NA 6
9 patient3 538 6 NA
此處,第 3 行和第 4 行包含相同測量的數據(得分為 3)但未連接,因為Days
列的值不同(235 與 248)。
問題:我正在尋找一種在第二列(比如 30 天)上設置閾值的方法,這將導致以下 output:
> threshold <- 30
> *** insert join code ***
ID Days Score.x Score.y
1 patient1 0 NA 1
2 patient1 25 2 10
3 patient1 248 3 3
4 patient1 353 4 4
5 patient2 100 5 5
6 patient2 150 NA 7
7 patient3 503 NA 6
8 patient3 538 6 NA
此 output 顯示先前 output 的第 3 行和第 4 行已被合並(因為 248-235 < 30)並已采用第二個 df 的Days
(248)的值。
要記住的三個主要條件是:
Days
變量值,因此不應合並。 可能這些值之一確實存在於另一個 dataframe 的閾值內,並且必須合並這些值。 請參閱下面示例中的第 3 行。> df1
ID Days Score
1 patient1 0 1
2 patient1 5 2
3 patient1 10 3
4 patient1 15 4
5 patient1 50 5
> df2
ID Days Score
1 patient1 0 1
2 patient1 5 2
3 patient1 12 3
4 patient1 15 4
5 patient1 50 5
> df_combined
ID Days Score.x Score.y
1 patient1 0 1 1
2 patient1 5 2 2
3 patient1 12 3 3
4 patient1 15 4 4
5 patient1 50 5 5
為 CHINSOON12 編輯
> df1
ID Days Score
1: patient1 0 1
2: patient1 116 2
3: patient1 225 3
4: patient1 309 4
5: patient1 351 5
6: patient2 0 6
7: patient2 49 7
> df2
ID Days Score
1: patient1 0 11
2: patient1 86 12
3: patient1 195 13
4: patient1 279 14
5: patient1 315 15
6: patient2 0 16
7: patient2 91 17
8: patient2 117 18
我將您的解決方案包裝在 function 中,如下所示:
testSO2 <- function(DT1,DT2) {
setDT(DT1);setDT(DT2)
names(DT1) <- c("ID","Days","X")
names(DT2) <- c("ID","Days","Y")
DT1$Days <- as.numeric(DT1$Days)
DT2$Days <- as.numeric(DT2$Days)
DT1[, c("s1", "e1", "s2", "e2") := .(Days - 30L, Days + 30L, Days, Days)]
DT2[, c("s1", "e1", "s2", "e2") := .(Days, Days, Days - 30L, Days + 30L)]
byk <- c("ID", "s1", "e1")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o1 <- foverlaps(DT1, DT2)
byk <- c("ID", "s2", "e2")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o2 <- foverlaps(DT2, DT1)
olaps <- funion(o1, setcolorder(o2, names(o1)))[
is.na(Days), Days := i.Days]
outcome <- olaps[, {
if (all(!is.na(Days)) && any(Days == i.Days)) {
s <- .SD[Days == i.Days, .(Days = Days[1L],
X = X[1L],
Y = Y[1L])]
} else {
s <- .SD[, .(Days = max(Days, i.Days), X, Y)]
}
unique(s)
},
keyby = .(ID, md = pmax(Days, i.Days))][, md := NULL][]
return(outcome)
}
結果是:
> testSO2(df1,df2)
ID Days X Y
1: patient1 0 1 11
2: patient1 116 2 12
3: patient1 225 3 13
4: patient1 309 4 14
5: patient1 315 4 15
6: patient1 351 5 NA
7: patient2 0 6 16
8: patient2 49 7 NA
9: patient2 91 NA 17
10: patient2 117 NA 18
如您所見,第 4 行和第 5 行是錯誤的。 df1 中的Score
值使用了兩次 (4)。 這些行周圍的正確 output 應如下所示,因為每個分數(在本例中為 X 或 Y)只能使用一次:
ID Days X Y
4: patient1 309 4 14
5: patient1 315 NA 15
6: patient1 351 5 NA
下面的數據框代碼。
> dput(df1)
structure(list(ID = c("patient1", "patient1", "patient1", "patient1",
"patient1", "patient2", "patient2"), Days = c("0", "116", "225",
"309", "351", "0", "49"), Score = 1:7), row.names = c(NA, 7L), class = "data.frame")
> dput(df2)
structure(list(ID = c("patient1", "patient1", "patient1", "patient1",
"patient1", "patient2", "patient2", "patient2"), Days = c("0",
"86", "195", "279", "315", "0", "91", "117"), Score = 11:18), row.names = c(NA,
8L), class = "data.frame")
聽起來像是對現實但混亂的數據集的數據清理練習,不幸的是,我們大多數人以前都有過這種經驗。 這是另一個data.table
選項:
DT1[, c("Xrn", "s1", "e1", "s2", "e2") := .(.I, Days - 30L, Days + 30L, Days, Days)]
DT2[, c("Yrn", "s1", "e1", "s2", "e2") := .(.I, Days, Days, Days - 30L, Days + 30L)]
byk <- c("ID", "s1", "e1")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o1 <- foverlaps(DT1, DT2)
byk <- c("ID", "s2", "e2")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o2 <- foverlaps(DT2, DT1)
olaps <- funion(o1, setcolorder(o2, names(o1)))[
is.na(Days), Days := i.Days]
ans <- olaps[, {
if (any(Days == i.Days)) {
.SD[Days == i.Days,
.(Days=Days[1L], Xrn=Xrn[1L], Yrn=Yrn[1L], X=X[1L], Y=Y[1L])]
} else {
.SD[, .(Days=md, Xrn=Xrn[1L], Yrn=Yrn[1L], X=X[1L], Y=Y[1L])]
}
},
keyby = .(ID, md = pmax(Days, i.Days))]
#or also ans[duplicated(Xrn), X := NA_integer_][duplicated(Yrn), Y := NA_integer_]
ans[rowid(Xrn) > 1L, X := NA_integer_]
ans[rowid(Yrn) > 1L, Y := NA_integer_]
ans[, c("md", "Xrn", "Yrn") := NULL][]
output 用於以下數據集:
ID Days X Y
1: 1 0 1 11
2: 1 10 2 12
3: 1 25 3 13
4: 1 248 4 14
5: 1 353 5 15
6: 2 100 6 16
7: 2 150 NA 17
8: 3 503 NA 18
9: 3 538 7 NA
output 用於 OP 編輯中的第二個數據集:
ID Days X Y
1: patient1 0 1 11
2: patient1 116 2 12
3: patient1 225 3 13
4: patient1 309 4 14
5: patient1 315 NA 15
6: patient1 351 5 NA
7: patient2 0 6 16
8: patient2 49 7 NA
9: patient2 91 NA 17
10: patient2 117 NA 18
數據(我從其他鏈接的帖子中添加了更多數據,並簡化了數據以便於查看):
library(data.table)
DT1 <- data.table(ID = c(1,1,1,1,1,2,3),
Days = c(0,10,25,235,353,100,538))[, X := .I]
DT2 <- data.table(ID = c(1,1,1,1,1,2,2,3),
Days = c(0,10,25,248,353,100,150,503))[, Y := .I + 10L]
解釋:
依次使用每個表作為左表執行 2 次重疊連接。
將右表中設置 NA 天之前的 2 個結果與左表中的結果合並。
按患者和重疊日期分組。 如果存在相同的日期,則保留記錄。 否則使用最大日期。
每個分數只能使用一次,因此刪除重復項。
如果您發現這種方法沒有給出正確結果的情況,請告訴我。
使用lapply
查找天數差異低於閾值的基本解決方案並制作expand.grid
以獲得所有可能的組合。 然后刪除那些會選擇兩次或在另一個后面選擇的那些。 從這些計算日差並選擇具有連續最低差的行。 然后rbind
來自 df2 的不匹配。
threshold <- 30
nmScore <- threshold
x <- do.call(rbind, lapply(unique(c(df1$ID, df2$ID)), function(ID) {
x <- df1[df1$ID == ID,]
y <- df2[df2$ID == ID,]
if(nrow(x) == 0) {return(data.frame(ID=ID, y[1,-1][NA,], y[,-1]))}
if(nrow(y) == 0) {return(data.frame(ID=ID, x[,-1], x[1,-1][NA,]))}
x <- x[order(x$Days),]
y <- y[order(y$Days),]
z <- do.call(expand.grid, lapply(x$Days, function(z) c(NA,
which(abs(z - y$Days) < threshold))))
z <- z[!apply(z, 1, function(z) {anyDuplicated(z[!is.na(z)]) > 0 ||
any(diff(z[!is.na(z)]) < 1)}), , drop = FALSE]
s <- as.data.frame(sapply(seq_len(ncol(z)), function(j) {
abs(x$Days[j] - y$Days[z[,j]])}))
s[is.na(s)] <- nmScore
s <- matrix(apply(s, 1, sort), nrow(s), byrow = TRUE)
i <- rep(TRUE, nrow(s))
for(j in seq_len(ncol(s))) {i[i] <- s[i,j] == min(s[i,j])}
i <- unlist(z[which.max(i),])
j <- setdiff(seq_len(nrow(y)), i)
rbind(data.frame(ID=ID, x[,-1], y[i, -1]),
if(length(j) > 0) data.frame(ID=ID, x[1,-1][NA,], y[j, -1], row.names=NULL))
}))
x <- x[order(x[,1], ifelse(is.na(x[,2]), x[,4], x[,2])),]
數據:
0..Boris Ruwe 的第一個測試用例,Boris Ruwe 的第 1..2 個測試用例,Boris Ruwe 的第 2..3 個測試用例,3..Uwe 的測試用例,4..Boris Ruwe 的測試用例 R 滾動連接兩個 data.tables 在 join 上有誤差,5..來自 GKi 的測試用例。
df1 <- structure(list(ID = c("0patient1", "0patient1", "0patient1",
"0patient1", "0patient2", "0patient3", "1patient1", "1patient1",
"1patient1", "1patient1", "1patient1", "2patient1", "2patient1",
"2patient1", "2patient1", "2patient1", "2patient2", "2patient2",
"3patient1", "3patient1", "3patient1", "3patient1", "3patient1",
"3patient1", "3patient2", "3patient3", "4patient1", "4patient1",
"4patient1", "4patient1", "4patient2", "4patient3", "5patient1",
"5patient1", "5patient1", "5patient2"), Days = c(0, 25, 235,
353, 100, 538, 0, 5, 10, 15, 50, 0, 116, 225, 309, 351, 0, 49,
0, 1, 25, 235, 237, 353, 100, 538, 0, 10, 25, 340, 100, 538,
3, 6, 10, 1), Score = c(NA, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 1,
2, 3, 4, 5, 6, 7, NA, 2, 3, 4, 5, 6, 7, 8, NA, 2, 3, 99, 5, 6,
1, 2, 3, 1)), row.names = c(NA, -36L), class = "data.frame")
df2 <- structure(list(ID = c("0patient1", "0patient1", "0patient1",
"0patient1", "0patient2", "0patient2", "0patient3", "1patient1",
"1patient1", "1patient1", "1patient1", "1patient1", "2patient1",
"2patient1", "2patient1", "2patient1", "2patient1", "2patient2",
"2patient2", "2patient2", "3patient1", "3patient1", "3patient1",
"3patient1", "3patient1", "3patient1", "3patient2", "3patient2",
"3patient3", "4patient1", "4patient1", "4patient1", "4patient1",
"4patient2", "4patient2", "4patient3", "5patient1", "5patient1",
"5patient1", "5patient3"), Days = c(0, 25, 248, 353, 100, 150,
503, 0, 5, 12, 15, 50, 0, 86, 195, 279, 315, 0, 91, 117, 0, 25,
233, 234, 248, 353, 100, 150, 503, 0, 10, 25, 353, 100, 150,
503, 1, 4, 8, 1), Score = c(1, 10, 3, 4, 5, 7, 6, 1, 2, 3, 4,
5, 11, 12, 13, 14, 15, 16, 17, 18, 11, 12, 13, 14, 15, 16, 17,
18, 19, 1, 10, 3, 4, 5, 7, 6, 11, 12, 13, 1)), row.names = c(NA,
-40L), class = "data.frame")
df1
# ID Days Score
#1 0patient1 0 NA
#2 0patient1 25 2
#3 0patient1 235 3
#4 0patient1 353 4
#5 0patient2 100 5
#6 0patient3 538 6
#7 1patient1 0 1
#8 1patient1 5 2
#9 1patient1 10 3
#10 1patient1 15 4
#11 1patient1 50 5
#12 2patient1 0 1
#13 2patient1 116 2
#14 2patient1 225 3
#15 2patient1 309 4
#16 2patient1 351 5
#17 2patient2 0 6
#18 2patient2 49 7
#19 3patient1 0 NA
#20 3patient1 1 2
#21 3patient1 25 3
#22 3patient1 235 4
#23 3patient1 237 5
#24 3patient1 353 6
#25 3patient2 100 7
#26 3patient3 538 8
#27 4patient1 0 NA
#28 4patient1 10 2
#29 4patient1 25 3
#30 4patient1 340 99
#31 4patient2 100 5
#32 4patient3 538 6
#33 5patient1 3 1
#34 5patient1 6 2
#35 5patient1 10 3
#36 5patient2 1 1
df2
# ID Days Score
#1 0patient1 0 1
#2 0patient1 25 10
#3 0patient1 248 3
#4 0patient1 353 4
#5 0patient2 100 5
#6 0patient2 150 7
#7 0patient3 503 6
#8 1patient1 0 1
#9 1patient1 5 2
#10 1patient1 12 3
#11 1patient1 15 4
#12 1patient1 50 5
#13 2patient1 0 11
#14 2patient1 86 12
#15 2patient1 195 13
#16 2patient1 279 14
#17 2patient1 315 15
#18 2patient2 0 16
#19 2patient2 91 17
#20 2patient2 117 18
#21 3patient1 0 11
#22 3patient1 25 12
#23 3patient1 233 13
#24 3patient1 234 14
#25 3patient1 248 15
#26 3patient1 353 16
#27 3patient2 100 17
#28 3patient2 150 18
#29 3patient3 503 19
#30 4patient1 0 1
#31 4patient1 10 10
#32 4patient1 25 3
#33 4patient1 353 4
#34 4patient2 100 5
#35 4patient2 150 7
#36 4patient3 503 6
#37 5patient1 1 11
#38 5patient1 4 12
#39 5patient1 8 13
#40 5patient3 1 1
結果:
# ID Days Score Days.1 Score.1
#1 0patient1 0 NA 0 1
#2 0patient1 25 2 25 10
#3 0patient1 235 3 248 3
#4 0patient1 353 4 353 4
#5 0patient2 100 5 100 5
#110 0patient2 NA NA 150 7
#111 0patient3 NA NA 503 6
#6 0patient3 538 6 NA NA
#7 1patient1 0 1 0 1
#8 1patient1 5 2 5 2
#9 1patient1 10 3 12 3
#10 1patient1 15 4 15 4
#11 1patient1 50 5 50 5
#12 2patient1 0 1 0 11
#112 2patient1 NA NA 86 12
#13 2patient1 116 2 NA NA
#210 2patient1 NA NA 195 13
#14 2patient1 225 3 NA NA
#37 2patient1 NA NA 279 14
#15 2patient1 309 4 315 15
#16 2patient1 351 5 NA NA
#17 2patient2 0 6 0 16
#18 2patient2 49 7 NA NA
#113 2patient2 NA NA 91 17
#211 2patient2 NA NA 117 18
#19 3patient1 0 NA 0 11
#20 3patient1 1 2 NA NA
#21 3patient1 25 3 25 12
#114 3patient1 NA NA 233 13
#22 3patient1 235 4 234 14
#23 3patient1 237 5 248 15
#24 3patient1 353 6 353 16
#25 3patient2 100 7 100 17
#115 3patient2 NA NA 150 18
#116 3patient3 NA NA 503 19
#26 3patient3 538 8 NA NA
#27 4patient1 0 NA 0 1
#28 4patient1 10 2 10 10
#29 4patient1 25 3 25 3
#30 4patient1 340 99 353 4
#31 4patient2 100 5 100 5
#117 4patient2 NA NA 150 7
#118 4patient3 NA NA 503 6
#32 4patient3 538 6 NA NA
#119 5patient1 NA NA 1 11
#33 5patient1 3 1 4 12
#34 5patient1 6 2 8 13
#35 5patient1 10 3 NA NA
#36 5patient2 1 1 NA NA
#NA 5patient3 NA NA 1 1
格式化結果:
data.frame(ID=x[,1], Days=ifelse(is.na(x[,2]), x[,4], x[,2]),
Score.x=x[,3], Score.y=x[,5])
# ID Days Score.x Score.y
#1 0patient1 0 NA 1
#2 0patient1 25 2 10
#3 0patient1 235 3 3
#4 0patient1 353 4 4
#5 0patient2 100 5 5
#6 0patient2 150 NA 7
#7 0patient3 503 NA 6
#8 0patient3 538 6 NA
#9 1patient1 0 1 1
#10 1patient1 5 2 2
#11 1patient1 10 3 3
#12 1patient1 15 4 4
#13 1patient1 50 5 5
#14 2patient1 0 1 11
#15 2patient1 86 NA 12
#16 2patient1 116 2 NA
#17 2patient1 195 NA 13
#18 2patient1 225 3 NA
#19 2patient1 279 NA 14
#20 2patient1 309 4 15
#21 2patient1 351 5 NA
#22 2patient2 0 6 16
#23 2patient2 49 7 NA
#24 2patient2 91 NA 17
#25 2patient2 117 NA 18
#26 3patient1 0 NA 11
#27 3patient1 1 2 NA
#28 3patient1 25 3 12
#29 3patient1 233 NA 13
#30 3patient1 235 4 14
#31 3patient1 237 5 15
#32 3patient1 353 6 16
#33 3patient2 100 7 17
#34 3patient2 150 NA 18
#35 3patient3 503 NA 19
#36 3patient3 538 8 NA
#37 4patient1 0 NA 1
#38 4patient1 10 2 10
#39 4patient1 25 3 3
#40 4patient1 340 99 4
#41 4patient2 100 5 5
#42 4patient2 150 NA 7
#43 4patient3 503 NA 6
#44 4patient3 538 6 NA
#45 5patient1 1 NA 11
#46 5patient1 3 1 12
#47 5patient1 6 2 13
#48 5patient1 10 3 NA
#49 5patient2 1 1 NA
#50 5patient3 1 NA 1
獲得Days
的替代方案:
#From df1 and in case it is NA I took it from df2
data.frame(ID=x[,1], Days=ifelse(is.na(x[,2]), x[,4], x[,2]),
Score.x=x[,3], Score.y=x[,5])
#From df2 and in case it is NA I took it from df1
data.frame(ID=x[,1], Days=ifelse(is.na(x[,4]), x[,2], x[,4]),
Score.x=x[,3], Score.y=x[,5])
#Mean
data.frame(ID=x[,1], Days=rowMeans(x[,c(2,4)], na.rm=TRUE),
Score.x=x[,3], Score.y=x[,5])
如果應盡量減少天數的總差異,允許不采用最近的,一種可能的方法是:
threshold <- 30
nmScore <- threshold
x <- do.call(rbind, lapply(unique(c(df1$ID, df2$ID)), function(ID) {
x <- df1[df1$ID == ID,]
y <- df2[df2$ID == ID,]
x <- x[order(x$Days),]
y <- y[order(y$Days),]
if(nrow(x) == 0) {return(data.frame(ID=ID, y[1,-1][NA,], y[,-1]))}
if(nrow(y) == 0) {return(data.frame(ID=ID, x[,-1], x[1,-1][NA,]))}
z <- do.call(expand.grid, lapply(x$Days, function(z) c(NA,
which(abs(z - y$Days) < threshold))))
z <- z[!apply(z, 1, function(z) {anyDuplicated(z[!is.na(z)]) > 0 ||
any(diff(z[!is.na(z)]) < 1)}), , drop = FALSE]
s <- as.data.frame(sapply(seq_len(ncol(z)), function(j) {
abs(x$Days[j] - y$Days[z[,j]])}))
s[is.na(s)] <- nmScore
i <- unlist(z[which.min(rowSums(s)),])
j <- setdiff(seq_len(nrow(y)), i)
rbind(data.frame(ID=ID, x[,-1], y[i, -1]),
if(length(j) > 0) data.frame(ID=ID, x[1,-1][NA,], y[j, -1], row.names=NULL))
}))
x <- x[order(x[,1], ifelse(is.na(x[,2]), x[,4], x[,2])),]
遲到了,這里有一個解決方案,它使用完全外連接,然后根據 OP 的規則對行進行分組和聚合。
library(data.table)
threshold <- 30
# full outer join
m <- merge(setDT(df1)[, o := 1L], setDT(df2)[, o := 2L],
by = c("ID", "Days"), all = TRUE)
# reorder rows
setorder(m, ID, Days)
# create grouping variable
m[, g := rleid(ID,
cumsum(c(TRUE, diff(Days) > threshold)),
!is.na(o.x) & !is.na(o.y),
cumsum(c(TRUE, diff(fcoalesce(o.x, o.y)) == 0L))
)][, g := rleid(g, (rowid(g) - 1L) %/% 2)][]
# collapse rows where required
m[, .(ID = last(ID), Days = last(Days),
Score.x = last(na.omit(Score.x)),
Score.y = last(na.omit(Score.y)))
, by = g][, g := NULL][]
對於 OP 的第一個測試用例,我們得到
ID Days Score.x Score.y 1: patient1 0 NA 1 2: patient1 25 2 10 3: patient1 248 3 3 4: patient1 353 4 4 5: patient2 100 5 5 6: patient2 150 NA 7 7: patient3 503 NA 6 8: patient3 538 6 NA
正如預期的那樣。
使用 OP 的第二個測試用例
df1 <- data.table(ID = rep("patient1", 5L), Days = c(0, 5, 10, 15, 50), Score = 1:5)
df2 <- data.table(ID = rep("patient1", 5L), Days = c(0, 5, 12, 15, 50), Score = 1:5)
我們得到
ID Days Score.x Score.y 1: patient1 0 1 1 2: patient1 5 2 2 3: patient1 12 3 3 4: patient1 15 4 4 5: patient1 50 5 5
使用 OP 的第三個測試用例(用於討論 chinsoon12 的答案)
df1 <- data.table(ID = paste0("patient", c(rep(1, 5L), 2, 2)),
Days = c(0, 116, 225, 309, 351, 0, 49), Score = 1:7)
df2 <- data.table(ID = paste0("patient", c(rep(1, 5L), 2, 2, 2)),
Days = c(0, 86, 195, 279, 315, 0, 91, 117), Score = 11:18)
我們得到
ID Days Score.x Score.y 1: patient1 0 1 11 2: patient1 116 2 12 3: patient1 225 3 13 4: patient1 309 4 14 5: patient1 315 NA 15 6: patient1 351 5 NA 7: patient2 0 6 16 8: patient2 49 7 NA 9: patient2 91 NA 17 10: patient2 117 NA 18
正如 OP 所期望的那樣(特別是第 5 行)
最后,我自己的測試用例在233和248之間有5個“重疊天”來驗證這個用例是否會被處理
df1 <- data.table(ID = paste0("patient", c(rep(1, 6L), 2, 3)),
Days = c(0,1,25,235,237,353,100,538),
Score = c(NA, 2:8))
df2 <- data.table(ID = paste0("patient", c(rep(1, 6L), 2, 2, 3)),
Days = c(0, 25, 233, 234, 248, 353, 100, 150, 503),
Score = 11:19)
我們得到
ID Days Score.x Score.y 1: patient1 0 NA 11 # exact match 2: patient1 1 2 NA # overlapping, not collapsed 3: patient1 25 3 12 # exact match 4: patient1 233 NA 13 # overlapping, not collapsed 5: patient1 235 4 14 # overlapping, collapsed 6: patient1 248 5 15 # overlapping, collapsed 7: patient1 353 6 16 # exact match 8: patient2 100 7 17 # exact match 9: patient2 150 NA 18 # not overlapping 10: patient3 503 NA 19 # not overlapping 11: patient3 538 8 NA # not overlapping
完全外連接merge(..., all = TRUE)
在相同的 ID 和日期上找到完全匹配,但包括兩個數據集中沒有匹配的所有其他行。
在加入之前,每個數據集都會獲得一個額外的列o
來指示每個Score
的來源。
結果是有序的,因為后續操作取決於正確的行順序。
所以,用我自己的測試用例,我們得到
m <- merge(setDT(df1)[, o := 1L], setDT(df2)[, o := 2L],
by = c("ID", "Days"), all = TRUE)
setorder(m, ID, Days)[]
ID Days Score.x ox Score.y oy 1: patient1 0 NA 1 11 2 2: patient1 1 2 1 NA NA 3: patient1 25 3 1 12 2 4: patient1 233 NA NA 13 2 5: patient1 234 NA NA 14 2 6: patient1 235 4 1 NA NA 7: patient1 237 5 1 NA NA 8: patient1 248 NA NA 15 2 9: patient1 353 6 1 16 2 10: patient2 100 7 1 17 2 11: patient2 150 NA NA 18 2 12: patient3 503 NA NA 19 2 13: patient3 538 8 1 NA NA
現在,使用rleid()
創建一個分組變量:
m[, g := rleid(ID,
cumsum(c(TRUE, diff(Days) > threshold)),
!is.na(o.x) & !is.na(o.y),
cumsum(c(TRUE, diff(fcoalesce(o.x, o.y)) == 0L))
)][, g := rleid(g, (rowid(g) - 1L) %/% 2)][]
當滿足以下條件之一時,組計數器會提前:
ID
改變ID
內,當連續Days
之間的間隔超過 30 天時(因此一個 ID 內間隔為 30 天或更短的行屬於一個組或“重疊”)1, 2, 1, 2, ...
或2, 1, 2, 1, ...
df1
,后一行來自df2
,或者一行來自df2
,后一行來自df1
。OP沒有明確說明最后一個條件,但這是我對
每個分數/天數/患者組合只能使用一次。 如果合並滿足所有條件但仍有可能進行雙重合並,則應使用第一個。
它確保最多折疊兩行,每行來自不同的數據集。
分組后我們得到
ID Days Score.x ox Score.y oy g 1: patient1 0 NA 1 11 2 1 2: patient1 1 2 1 NA NA 2 3: patient1 25 3 1 12 2 3 4: patient1 233 NA NA 13 2 4 5: patient1 234 NA NA 14 2 5 6: patient1 235 4 1 NA NA 5 7: patient1 237 5 1 NA NA 6 8: patient1 248 NA NA 15 2 6 9: patient1 353 6 1 16 2 7 10: patient2 100 7 1 17 2 8 11: patient2 150 NA NA 18 2 9 12: patient3 503 NA NA 19 2 10 13: patient3 538 8 1 NA NA 11
大多數組僅包含一行,少數包含 2 行,這些行在最后一步中折疊(按組聚合,返回所需的列並刪除分組變量g
)。
按組聚合要求對於每個組,每列僅返回一個值(長度為 1 的向量)。 (否則,組結果將由多行組成。)為簡單起見,上面的實現在所有 4 列上都使用了last()
。
last(Days)
等價於max(Days)
因為數據集是有序的。
但是,如果我理解正確,OP 更喜歡從df2
返回Days
值(盡管 OP 已經提到max(Days)
也是可以接受的)。
為了從df2
返回Days
值,需要修改聚合步驟:如果組大小.N
大於 1,我們從源自df2
的行中選擇Days
值,即oy == 2
。
# collapse rows where required
m[, .(ID = last(ID),
Days = last(if (.N > 1) Days[which(o.y == 2)] else Days),
Score.x = last(na.omit(Score.x)),
Score.y = last(na.omit(Score.y)))
, by = g][, g := NULL][]
這將返回
ID Days Score.x Score.y 1: patient1 0 NA 11 2: patient1 1 2 NA 3: patient1 25 3 12 4: patient1 233 NA 13 5: patient1 234 4 14 6: patient1 248 5 15 7: patient1 353 6 16 8: patient2 100 7 17 9: patient2 150 NA 18 10: patient3 503 NA 19 11: patient3 538 8 NA
現在已從df2
中選擇折疊的第 5 行中的Days
值 234 。
對於Score
列, last()
的使用根本不重要,因為在一組 2 行中應該只有一個非 NA 值。 因此, na.omit()
應該只返回一個值,而last()
可能只是為了保持一致性。
此代碼允許您給出一個閾值,然后將 df1 中的分數合並到 df1 作為新列。 它只會添加落在 df2 +/- 閾值的單個分數范圍內的分數。 請注意,不可能將所有分數都連接起來,因為沒有閾值可以讓所有分數唯一匹配。
threshold <- 40
WhereDF1inDF2 <- apply(sapply(lapply(df2$Days, function(x) (x+threshold):(x-threshold)), function(y) df1$Days %in% y),1,which)
useable <- sapply(WhereDF1inDF2, function(x) length(x) ==1 )
df2$Score1 <- NA
df2$Score1[unlist(WhereDF1inDF2[useable])] <- df1$Score[useable]
> df2
ID Days Score Score1
1 patient1 0 1 NA
2 patient1 25 10 NA
3 patient1 248 3 3
4 patient1 353 4 4
5 patient2 100 5 5
6 patient2 150 7 NA
7 patient3 503 6 6
這是一個可能的data.table
解決方案
library(data.table)
#convert df1 and df2 to data.table format
setDT(df1);setDT(df2)
#set colnames for later on
# (add .df1/.df2 suffix after Days and Score-colnamaes)
cols <- c("Days", "Score")
setnames(df1, cols, paste0( cols, ".df1" ) )
setnames(df2, cols, paste0( cols, ".df2" ) )
#update df1 with new measures from df2 (and df2 with df1)
# copies are made, to prevent changes in df1 and df2
dt1 <- copy(df1)[ df2, `:=`(Days.df2 = i.Days.df2, Score.df2 = i.Score.df2), on = .(ID, Days.df1 = Days.df2), roll = 30]
dt2 <- copy(df2)[ df1, `:=`(Days.df1 = i.Days.df1, Score.df1 = i.Score.df1), on = .(ID, Days.df2 = Days.df1), roll = -30]
#rowbind by columnnames (here the .df1/.df2 suffix is needed!), only keep unique rows
ans <- unique( rbindlist( list( dt1, dt2), use.names = TRUE ) )
#wrangle data to get to desired output
ans[, Days := ifelse( is.na(Days.df2), Days.df1, Days.df2 ) ]
ans <- ans[, .(Days, Score.x = Score.df1, Score.y = Score.df2 ), by = .(ID) ]
setkey( ans, ID, Days ) #for sorting; setorder() can also be used.
# ID Days Score.x Score.y
# 1: patient1 0 NA 1
# 2: patient1 25 2 10
# 3: patient1 248 3 3
# 4: patient1 353 4 4
# 5: patient2 100 5 5
# 6: patient2 150 NA 7
# 7: patient3 503 NA 6
# 8: patient3 538 6 NA
以下代碼適用於您的示例數據。 根據您的條件,它應該適用於您的完整數據。 對於其他例外情況,您可以調整df31
和df32
。
df1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"),
"Days1" = c(0,25,235,353,100,538),
"Score1" = c(NA,2,3,4,5,6),
stringsAsFactors = FALSE)
df2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"),
"Days2" = c(0,25,248,353,100,150,503),
"Score2" = c(1,10,3,4,5,7,6),
stringsAsFactors = FALSE)
## define a dummy sequence for each patient
df11 <- df1 %>% group_by(ID) %>% mutate(ptseq = row_number())
df21 <- df2 %>% group_by(ID) %>% mutate(ptseq = row_number())
df3 <- dplyr::full_join(df11, df21, by=c("ID","ptseq")) %>%
arrange(.[[1]], as.numeric(.[[2]]))
df31 <- df3 %>% mutate(Days=Days2, diff=Days1-Days2) %>%
mutate(Score1=ifelse(abs(diff)>30, NA, Score1))
df32 <- df3 %>% mutate(diff=Days1-Days2) %>%
mutate(Days = case_when(abs(diff)>30 ~ Days1), Score2=c(NA), Days2=c(NA)) %>%
subset(!is.na(Days))
df <- rbind(df31,df32) %>% select(ID, ptseq, Days, Score1, Score2) %>%
arrange(.[[1]], as.numeric(.[[2]])) %>% select(-2)
>df
ID Days Score1 Score2
<chr> <dbl> <dbl> <dbl>
1 patient1 0 NA 1
2 patient1 25 2 10
3 patient1 248 3 3
4 patient1 353 4 4
5 patient2 100 5 5
6 patient2 150 NA 7
7 patient3 503 NA 6
8 patient3 538 6 NA
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.