簡體   English   中英

R 基於特定列合並兩個數據集並添加條件

[英]R merge two datasets based on specific columns with added condition

Uwe 和 GKi 的答案都是正確的。 Gki 收到賞金是因為 Uwe 遲到了,但 Uwe 的解決方案運行速度大約是 15 倍

我有兩個數據集,其中包含不同患者在多個測量時刻的分數,如下所示:

df1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"),
                  "Days" = c(0,25,235,353,100,538),
                  "Score" = c(NA,2,3,4,5,6), 
                  stringsAsFactors = FALSE)
df2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"),
                  "Days" = c(0,25,248,353,100,150,503),
                  "Score" = c(1,10,3,4,5,7,6), 
                  stringsAsFactors = FALSE)
> df1
        ID Days Score
1 patient1    0    NA
2 patient1   25     2
3 patient1  235     3
4 patient1  353     4
5 patient2  100     5
6 patient3  538     6

> df2
        ID Days Score
1 patient1    0     1
2 patient1   25    10
3 patient1  248     3
4 patient1  353     4
5 patient2  100     5
6 patient2  150     7
7 patient3  503     6

ID列顯示患者 ID, Days列顯示測量時刻(患者納入后的天數), Score列顯示測量得分。 兩個數據集都顯示相同的數據,但時間不同(df1 是 2 年前,df2 具有相同的數據,但從今年開始更新)。

我必須比較兩個數據集之間每個患者和每個時刻的分數。 但是,在某些情況下, Days變量會隨着時間的推移而發生細微的變化,因此通過簡單的連接來比較數據集是行不通的。 例子:

library(dplyr)

> full_join(df1, df2, by=c("ID","Days")) %>% 
+   arrange(.[[1]], as.numeric(.[[2]]))

        ID Days Score.x Score.y
1 patient1    0      NA       1
2 patient1   25       2      10
3 patient1  235       3      NA
4 patient1  248      NA       3
5 patient1  353       4       4
6 patient2  100       5       5
7 patient2  150      NA       7
8 patient3  503      NA       6
9 patient3  538       6      NA

此處,第 3 行和第 4 行包含相同測量的數據(得分為 3)但未連接,因為Days列的值不同(235 與 248)。

問題:我正在尋找一種在第二列(比如 30 天)上設置閾值的方法,這將導致以下 output:

> threshold <- 30
> *** insert join code ***

        ID Days Score.x Score.y
1 patient1    0      NA       1
2 patient1   25       2      10
3 patient1  248       3       3
4 patient1  353       4       4
5 patient2  100       5       5
6 patient2  150      NA       7
7 patient3  503      NA       6
8 patient3  538       6      NA

此 output 顯示先前 output 的第 3 行和第 4 行已被合並(因為 248-235 < 30)並已采用第二個 df 的Days (248)的值。

要記住的三個主要條件是:

  • 不合並同一 df(第 1 行和第 2 行)內閾值內的連續天數。
  • 在某些情況下,同一 dataframe 中最多存在四個Days變量值,因此不應合並。 可能這些值之一確實存在於另一個 dataframe 的閾值內,並且必須合並這些值。 請參閱下面示例中的第 3 行。
  • 每個分數/天數/患者組合只能使用一次。 如果合並滿足所有條件但仍有可能進行雙重合並,則應使用第一個。
> df1
        ID Days Score
1 patient1    0     1
2 patient1    5     2
3 patient1   10     3
4 patient1   15     4
5 patient1   50     5

> df2
        ID Days Score
1 patient1    0     1
2 patient1    5     2
3 patient1   12     3
4 patient1   15     4
5 patient1   50     5

> df_combined
        ID Days Score.x Score.y
1 patient1    0       1       1
2 patient1    5       2       2
3 patient1   12       3       3
4 patient1   15       4       4
5 patient1   50       5       5

為 CHINSOON12 編輯

> df1
          ID Days Score
 1: patient1    0     1
 2: patient1  116     2
 3: patient1  225     3
 4: patient1  309     4
 5: patient1  351     5
 6: patient2    0     6
 7: patient2   49     7
> df2
          ID Days Score
 1: patient1    0    11
 2: patient1   86    12
 3: patient1  195    13
 4: patient1  279    14
 5: patient1  315    15
 6: patient2    0    16
 7: patient2   91    17
 8: patient2  117    18

我將您的解決方案包裝在 function 中,如下所示:

testSO2 <- function(DT1,DT2) {
    setDT(DT1);setDT(DT2)
    names(DT1) <- c("ID","Days","X")
    names(DT2) <- c("ID","Days","Y")
    DT1$Days <- as.numeric(DT1$Days)
    DT2$Days <- as.numeric(DT2$Days)
    DT1[, c("s1", "e1", "s2", "e2") := .(Days - 30L, Days + 30L, Days, Days)]
    DT2[, c("s1", "e1", "s2", "e2") := .(Days, Days, Days - 30L, Days + 30L)]
    byk <- c("ID", "s1", "e1")
    setkeyv(DT1, byk)
    setkeyv(DT2, byk)
    o1 <- foverlaps(DT1, DT2)

    byk <- c("ID", "s2", "e2")
    setkeyv(DT1, byk)
    setkeyv(DT2, byk)
    o2 <- foverlaps(DT2, DT1)

    olaps <- funion(o1, setcolorder(o2, names(o1)))[
        is.na(Days), Days := i.Days]

    outcome <- olaps[, {
        if (all(!is.na(Days)) && any(Days == i.Days)) {
            s <- .SD[Days == i.Days, .(Days = Days[1L],
                                       X = X[1L],
                                       Y = Y[1L])]
        } else {
            s <- .SD[, .(Days = max(Days, i.Days), X, Y)]
        }
        unique(s)
    },
    keyby = .(ID, md = pmax(Days, i.Days))][, md := NULL][]
    return(outcome)
}

結果是:

> testSO2(df1,df2)
          ID Days  X  Y
 1: patient1    0  1 11
 2: patient1  116  2 12
 3: patient1  225  3 13
 4: patient1  309  4 14
 5: patient1  315  4 15
 6: patient1  351  5 NA
 7: patient2    0  6 16
 8: patient2   49  7 NA
 9: patient2   91 NA 17
10: patient2  117 NA 18

如您所見,第 4 行和第 5 行是錯誤的。 df1 中的Score值使用了兩次 (4)。 這些行周圍的正確 output 應如下所示,因為每個分數(在本例中為 X 或 Y)只能使用一次:

          ID Days  X  Y
 4: patient1  309  4 14
 5: patient1  315 NA 15
 6: patient1  351  5 NA

下面的數據框代碼。

> dput(df1)
structure(list(ID = c("patient1", "patient1", "patient1", "patient1", 
"patient1", "patient2", "patient2"), Days = c("0", "116", "225", 
"309", "351", "0", "49"), Score = 1:7), row.names = c(NA, 7L), class = "data.frame")
> dput(df2)
structure(list(ID = c("patient1", "patient1", "patient1", "patient1", 
"patient1", "patient2", "patient2", "patient2"), Days = c("0", 
"86", "195", "279", "315", "0", "91", "117"), Score = 11:18), row.names = c(NA, 
8L), class = "data.frame")

聽起來像是對現實但混亂的數據集的數據清理練習,不幸的是,我們大多數人以前都有過這種經驗。 這是另一個data.table選項:

DT1[, c("Xrn", "s1", "e1", "s2", "e2") := .(.I, Days - 30L, Days + 30L, Days, Days)]
DT2[, c("Yrn", "s1", "e1", "s2", "e2") := .(.I, Days, Days, Days - 30L, Days + 30L)]
byk <- c("ID", "s1", "e1")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o1 <- foverlaps(DT1, DT2)

byk <- c("ID", "s2", "e2")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o2 <- foverlaps(DT2, DT1)

olaps <- funion(o1, setcolorder(o2, names(o1)))[
    is.na(Days), Days := i.Days]

ans <- olaps[, {
        if (any(Days == i.Days)) {
            .SD[Days == i.Days, 
                .(Days=Days[1L], Xrn=Xrn[1L], Yrn=Yrn[1L], X=X[1L], Y=Y[1L])]
        } else {
            .SD[, .(Days=md, Xrn=Xrn[1L], Yrn=Yrn[1L], X=X[1L], Y=Y[1L])]
        }
    },
    keyby = .(ID, md = pmax(Days, i.Days))]

#or also ans[duplicated(Xrn), X := NA_integer_][duplicated(Yrn), Y := NA_integer_]
ans[rowid(Xrn) > 1L, X := NA_integer_]
ans[rowid(Yrn) > 1L, Y := NA_integer_]
ans[, c("md", "Xrn", "Yrn") := NULL][]

output 用於以下數據集:

   ID Days  X  Y
1:  1    0  1 11
2:  1   10  2 12
3:  1   25  3 13
4:  1  248  4 14
5:  1  353  5 15
6:  2  100  6 16
7:  2  150 NA 17
8:  3  503 NA 18
9:  3  538  7 NA

output 用於 OP 編輯中的第二個數據集:

          ID Days  X  Y
 1: patient1    0  1 11
 2: patient1  116  2 12
 3: patient1  225  3 13
 4: patient1  309  4 14
 5: patient1  315 NA 15
 6: patient1  351  5 NA
 7: patient2    0  6 16
 8: patient2   49  7 NA
 9: patient2   91 NA 17
10: patient2  117 NA 18

數據(我從其他鏈接的帖子中添加了更多數據,並簡化了數據以便於查看):

library(data.table)
DT1 <- data.table(ID = c(1,1,1,1,1,2,3),
    Days = c(0,10,25,235,353,100,538))[, X := .I]
DT2 <- data.table(ID = c(1,1,1,1,1,2,2,3),
    Days = c(0,10,25,248,353,100,150,503))[, Y := .I + 10L]

解釋:

  1. 依次使用每個表作為左表執行 2 次重疊連接。

  2. 將右表中設置 NA 天之前的 2 個結果與左表中的結果合並。

  3. 按患者和重疊日期分組。 如果存在相同的日期,則保留記錄。 否則使用最大日期。

  4. 每個分數只能使用一次,因此刪除重復項。

如果您發現這種方法沒有給出正確結果的情況,請告訴我。

使用lapply查找天數差異低於閾值基本解決方案並制作expand.grid以獲得所有可能的組合。 然后刪除那些會選擇兩次或在另一個后面選擇的那些。 從這些計算日差並選擇具有連續最低差的行。 然后rbind來自 df2 的不匹配。

threshold <- 30
nmScore <- threshold
x <- do.call(rbind, lapply(unique(c(df1$ID, df2$ID)), function(ID) {
  x <- df1[df1$ID == ID,]
  y <- df2[df2$ID == ID,]
  if(nrow(x) == 0) {return(data.frame(ID=ID, y[1,-1][NA,], y[,-1]))}
  if(nrow(y) == 0) {return(data.frame(ID=ID, x[,-1], x[1,-1][NA,]))}
  x <- x[order(x$Days),]
  y <- y[order(y$Days),]
  z <- do.call(expand.grid, lapply(x$Days, function(z) c(NA,
         which(abs(z - y$Days) < threshold))))
  z <- z[!apply(z, 1, function(z) {anyDuplicated(z[!is.na(z)]) > 0 ||
         any(diff(z[!is.na(z)]) < 1)}), , drop = FALSE]
  s <- as.data.frame(sapply(seq_len(ncol(z)), function(j) {
         abs(x$Days[j] - y$Days[z[,j]])}))
  s[is.na(s)] <- nmScore
  s <- matrix(apply(s, 1, sort), nrow(s), byrow = TRUE)
  i <- rep(TRUE, nrow(s))
  for(j in seq_len(ncol(s))) {i[i]  <- s[i,j] == min(s[i,j])}
  i <- unlist(z[which.max(i),])
  j <- setdiff(seq_len(nrow(y)), i)
  rbind(data.frame(ID=ID, x[,-1], y[i, -1]),
  if(length(j) > 0) data.frame(ID=ID, x[1,-1][NA,], y[j, -1], row.names=NULL))
}))
x <- x[order(x[,1], ifelse(is.na(x[,2]), x[,4], x[,2])),]

數據:

0..Boris Ruwe 的第一個測試用例,Boris Ruwe 的第 1..2 個測試用例,Boris Ruwe 的第 2..3 個測試用例,3..Uwe 的測試用例,4..Boris Ruwe 的測試用例 R 滾動連接兩個 data.tables 在 join 上有誤差,5..來自 GKi 的測試用例。

df1 <- structure(list(ID = c("0patient1", "0patient1", "0patient1", 
"0patient1", "0patient2", "0patient3", "1patient1", "1patient1", 
"1patient1", "1patient1", "1patient1", "2patient1", "2patient1", 
"2patient1", "2patient1", "2patient1", "2patient2", "2patient2", 
"3patient1", "3patient1", "3patient1", "3patient1", "3patient1", 
"3patient1", "3patient2", "3patient3", "4patient1", "4patient1", 
"4patient1", "4patient1", "4patient2", "4patient3", "5patient1", 
"5patient1", "5patient1", "5patient2"), Days = c(0, 25, 235, 
353, 100, 538, 0, 5, 10, 15, 50, 0, 116, 225, 309, 351, 0, 49, 
0, 1, 25, 235, 237, 353, 100, 538, 0, 10, 25, 340, 100, 538, 
3, 6, 10, 1), Score = c(NA, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 1, 
2, 3, 4, 5, 6, 7, NA, 2, 3, 4, 5, 6, 7, 8, NA, 2, 3, 99, 5, 6, 
1, 2, 3, 1)), row.names = c(NA, -36L), class = "data.frame")
df2 <- structure(list(ID = c("0patient1", "0patient1", "0patient1", 
"0patient1", "0patient2", "0patient2", "0patient3", "1patient1", 
"1patient1", "1patient1", "1patient1", "1patient1", "2patient1", 
"2patient1", "2patient1", "2patient1", "2patient1", "2patient2", 
"2patient2", "2patient2", "3patient1", "3patient1", "3patient1", 
"3patient1", "3patient1", "3patient1", "3patient2", "3patient2", 
"3patient3", "4patient1", "4patient1", "4patient1", "4patient1", 
"4patient2", "4patient2", "4patient3", "5patient1", "5patient1", 
"5patient1", "5patient3"), Days = c(0, 25, 248, 353, 100, 150, 
503, 0, 5, 12, 15, 50, 0, 86, 195, 279, 315, 0, 91, 117, 0, 25, 
233, 234, 248, 353, 100, 150, 503, 0, 10, 25, 353, 100, 150, 
503, 1, 4, 8, 1), Score = c(1, 10, 3, 4, 5, 7, 6, 1, 2, 3, 4, 
5, 11, 12, 13, 14, 15, 16, 17, 18, 11, 12, 13, 14, 15, 16, 17, 
18, 19, 1, 10, 3, 4, 5, 7, 6, 11, 12, 13, 1)), row.names = c(NA, 
-40L), class = "data.frame")
df1
#          ID Days Score
#1  0patient1    0    NA
#2  0patient1   25     2
#3  0patient1  235     3
#4  0patient1  353     4
#5  0patient2  100     5
#6  0patient3  538     6
#7  1patient1    0     1
#8  1patient1    5     2
#9  1patient1   10     3
#10 1patient1   15     4
#11 1patient1   50     5
#12 2patient1    0     1
#13 2patient1  116     2
#14 2patient1  225     3
#15 2patient1  309     4
#16 2patient1  351     5
#17 2patient2    0     6
#18 2patient2   49     7
#19 3patient1    0    NA
#20 3patient1    1     2
#21 3patient1   25     3
#22 3patient1  235     4
#23 3patient1  237     5
#24 3patient1  353     6
#25 3patient2  100     7
#26 3patient3  538     8
#27 4patient1    0    NA
#28 4patient1   10     2
#29 4patient1   25     3
#30 4patient1  340    99
#31 4patient2  100     5
#32 4patient3  538     6
#33 5patient1    3     1
#34 5patient1    6     2
#35 5patient1   10     3
#36 5patient2    1     1

df2
#          ID Days Score
#1  0patient1    0     1
#2  0patient1   25    10
#3  0patient1  248     3
#4  0patient1  353     4
#5  0patient2  100     5
#6  0patient2  150     7
#7  0patient3  503     6
#8  1patient1    0     1
#9  1patient1    5     2
#10 1patient1   12     3
#11 1patient1   15     4
#12 1patient1   50     5
#13 2patient1    0    11
#14 2patient1   86    12
#15 2patient1  195    13
#16 2patient1  279    14
#17 2patient1  315    15
#18 2patient2    0    16
#19 2patient2   91    17
#20 2patient2  117    18
#21 3patient1    0    11
#22 3patient1   25    12
#23 3patient1  233    13
#24 3patient1  234    14
#25 3patient1  248    15
#26 3patient1  353    16
#27 3patient2  100    17
#28 3patient2  150    18
#29 3patient3  503    19
#30 4patient1    0     1
#31 4patient1   10    10
#32 4patient1   25     3
#33 4patient1  353     4
#34 4patient2  100     5
#35 4patient2  150     7
#36 4patient3  503     6
#37 5patient1    1    11
#38 5patient1    4    12
#39 5patient1    8    13
#40 5patient3    1     1

結果:

#           ID Days Score Days.1 Score.1
#1   0patient1    0    NA      0       1
#2   0patient1   25     2     25      10
#3   0patient1  235     3    248       3
#4   0patient1  353     4    353       4
#5   0patient2  100     5    100       5
#110 0patient2   NA    NA    150       7
#111 0patient3   NA    NA    503       6
#6   0patient3  538     6     NA      NA
#7   1patient1    0     1      0       1
#8   1patient1    5     2      5       2
#9   1patient1   10     3     12       3
#10  1patient1   15     4     15       4
#11  1patient1   50     5     50       5
#12  2patient1    0     1      0      11
#112 2patient1   NA    NA     86      12
#13  2patient1  116     2     NA      NA
#210 2patient1   NA    NA    195      13
#14  2patient1  225     3     NA      NA
#37  2patient1   NA    NA    279      14
#15  2patient1  309     4    315      15
#16  2patient1  351     5     NA      NA
#17  2patient2    0     6      0      16
#18  2patient2   49     7     NA      NA
#113 2patient2   NA    NA     91      17
#211 2patient2   NA    NA    117      18
#19  3patient1    0    NA      0      11
#20  3patient1    1     2     NA      NA
#21  3patient1   25     3     25      12
#114 3patient1   NA    NA    233      13
#22  3patient1  235     4    234      14
#23  3patient1  237     5    248      15
#24  3patient1  353     6    353      16
#25  3patient2  100     7    100      17
#115 3patient2   NA    NA    150      18
#116 3patient3   NA    NA    503      19
#26  3patient3  538     8     NA      NA
#27  4patient1    0    NA      0       1
#28  4patient1   10     2     10      10
#29  4patient1   25     3     25       3
#30  4patient1  340    99    353       4
#31  4patient2  100     5    100       5
#117 4patient2   NA    NA    150       7
#118 4patient3   NA    NA    503       6
#32  4patient3  538     6     NA      NA
#119 5patient1   NA    NA      1      11
#33  5patient1    3     1      4      12
#34  5patient1    6     2      8      13
#35  5patient1   10     3     NA      NA
#36  5patient2    1     1     NA      NA
#NA  5patient3   NA    NA      1       1

格式化結果:

data.frame(ID=x[,1], Days=ifelse(is.na(x[,2]), x[,4], x[,2]),
 Score.x=x[,3], Score.y=x[,5])
#          ID Days Score.x Score.y
#1  0patient1    0      NA       1
#2  0patient1   25       2      10
#3  0patient1  235       3       3
#4  0patient1  353       4       4
#5  0patient2  100       5       5
#6  0patient2  150      NA       7
#7  0patient3  503      NA       6
#8  0patient3  538       6      NA
#9  1patient1    0       1       1
#10 1patient1    5       2       2
#11 1patient1   10       3       3
#12 1patient1   15       4       4
#13 1patient1   50       5       5
#14 2patient1    0       1      11
#15 2patient1   86      NA      12
#16 2patient1  116       2      NA
#17 2patient1  195      NA      13
#18 2patient1  225       3      NA
#19 2patient1  279      NA      14
#20 2patient1  309       4      15
#21 2patient1  351       5      NA
#22 2patient2    0       6      16
#23 2patient2   49       7      NA
#24 2patient2   91      NA      17
#25 2patient2  117      NA      18
#26 3patient1    0      NA      11
#27 3patient1    1       2      NA
#28 3patient1   25       3      12
#29 3patient1  233      NA      13
#30 3patient1  235       4      14
#31 3patient1  237       5      15
#32 3patient1  353       6      16
#33 3patient2  100       7      17
#34 3patient2  150      NA      18
#35 3patient3  503      NA      19
#36 3patient3  538       8      NA
#37 4patient1    0      NA       1
#38 4patient1   10       2      10
#39 4patient1   25       3       3
#40 4patient1  340      99       4
#41 4patient2  100       5       5
#42 4patient2  150      NA       7
#43 4patient3  503      NA       6
#44 4patient3  538       6      NA
#45 5patient1    1      NA      11
#46 5patient1    3       1      12
#47 5patient1    6       2      13
#48 5patient1   10       3      NA
#49 5patient2    1       1      NA
#50 5patient3    1      NA       1

獲得Days的替代方案:

#From df1 and in case it is NA I took it from df2
data.frame(ID=x[,1], Days=ifelse(is.na(x[,2]), x[,4], x[,2]),
 Score.x=x[,3], Score.y=x[,5])

#From df2 and in case it is NA I took it from df1
data.frame(ID=x[,1], Days=ifelse(is.na(x[,4]), x[,2], x[,4]),
 Score.x=x[,3], Score.y=x[,5])

#Mean
data.frame(ID=x[,1], Days=rowMeans(x[,c(2,4)], na.rm=TRUE),
 Score.x=x[,3], Score.y=x[,5])

如果應盡量減少天數的差異,允許不采用最近的,一種可能的方法是:

threshold <- 30
nmScore <- threshold
x <- do.call(rbind, lapply(unique(c(df1$ID, df2$ID)), function(ID) {
  x <- df1[df1$ID == ID,]
  y <- df2[df2$ID == ID,]
  x <- x[order(x$Days),]
  y <- y[order(y$Days),]
  if(nrow(x) == 0) {return(data.frame(ID=ID, y[1,-1][NA,], y[,-1]))}
  if(nrow(y) == 0) {return(data.frame(ID=ID, x[,-1], x[1,-1][NA,]))}
  z <- do.call(expand.grid, lapply(x$Days, function(z) c(NA,
         which(abs(z - y$Days) < threshold))))
  z <- z[!apply(z, 1, function(z) {anyDuplicated(z[!is.na(z)]) > 0 ||
         any(diff(z[!is.na(z)]) < 1)}), , drop = FALSE]
  s <- as.data.frame(sapply(seq_len(ncol(z)), function(j) {
         abs(x$Days[j] - y$Days[z[,j]])}))
  s[is.na(s)] <- nmScore
  i <- unlist(z[which.min(rowSums(s)),])
  j <- setdiff(seq_len(nrow(y)), i)
  rbind(data.frame(ID=ID, x[,-1], y[i, -1]),
  if(length(j) > 0) data.frame(ID=ID, x[1,-1][NA,], y[j, -1], row.names=NULL))
}))
x <- x[order(x[,1], ifelse(is.na(x[,2]), x[,4], x[,2])),]

遲到了,這里有一個解決方案,它使用完全外連接然后根據 OP 的規則對行進行分組和聚合

library(data.table)
threshold <- 30
# full outer join
m <- merge(setDT(df1)[, o := 1L], setDT(df2)[, o := 2L], 
           by = c("ID", "Days"), all = TRUE)
# reorder rows
setorder(m, ID, Days)
# create grouping variable
m[, g := rleid(ID,
               cumsum(c(TRUE, diff(Days) > threshold)),
               !is.na(o.x) & !is.na(o.y),
               cumsum(c(TRUE, diff(fcoalesce(o.x, o.y)) == 0L))
)][, g := rleid(g, (rowid(g) - 1L) %/% 2)][]
# collapse rows where required
m[, .(ID = last(ID), Days = last(Days), 
      Score.x = last(na.omit(Score.x)), 
      Score.y = last(na.omit(Score.y)))
  , by = g][, g := NULL][]

對於 OP 的第一個測試用例,我們得到

 ID Days Score.x Score.y 1: patient1 0 NA 1 2: patient1 25 2 10 3: patient1 248 3 3 4: patient1 353 4 4 5: patient2 100 5 5 6: patient2 150 NA 7 7: patient3 503 NA 6 8: patient3 538 6 NA

正如預期的那樣。

使用其他用例進行驗證

使用 OP 的第二個測試用例

df1 <- data.table(ID = rep("patient1", 5L), Days = c(0, 5, 10, 15, 50), Score = 1:5)
df2 <- data.table(ID = rep("patient1", 5L), Days = c(0, 5, 12, 15, 50), Score = 1:5)

我們得到

 ID Days Score.x Score.y 1: patient1 0 1 1 2: patient1 5 2 2 3: patient1 12 3 3 4: patient1 15 4 4 5: patient1 50 5 5

使用 OP 的第三個測試用例(用於討論 chinsoon12 的答案

df1 <- data.table(ID = paste0("patient", c(rep(1, 5L), 2, 2)), 
                  Days = c(0, 116, 225, 309, 351, 0, 49), Score = 1:7)
df2 <- data.table(ID = paste0("patient", c(rep(1, 5L), 2, 2, 2)), 
                  Days = c(0, 86, 195, 279, 315, 0, 91, 117), Score = 11:18)

我們得到

 ID Days Score.x Score.y 1: patient1 0 1 11 2: patient1 116 2 12 3: patient1 225 3 13 4: patient1 309 4 14 5: patient1 315 NA 15 6: patient1 351 5 NA 7: patient2 0 6 16 8: patient2 49 7 NA 9: patient2 91 NA 17 10: patient2 117 NA 18

正如 OP 所期望的那樣(特別是第 5 行)

最后,我自己的測試用例在233和248之間有5個“重疊天”來驗證這個用例是否會被處理

df1 <- data.table(ID = paste0("patient", c(rep(1, 6L), 2, 3)),
                  Days = c(0,1,25,235,237,353,100,538),
                  Score = c(NA, 2:8))
df2 <- data.table(ID = paste0("patient", c(rep(1, 6L), 2, 2, 3)),
                  Days = c(0, 25, 233, 234, 248, 353, 100, 150, 503),
                  Score = 11:19)

我們得到

 ID Days Score.x Score.y 1: patient1 0 NA 11 # exact match 2: patient1 1 2 NA # overlapping, not collapsed 3: patient1 25 3 12 # exact match 4: patient1 233 NA 13 # overlapping, not collapsed 5: patient1 235 4 14 # overlapping, collapsed 6: patient1 248 5 15 # overlapping, collapsed 7: patient1 353 6 16 # exact match 8: patient2 100 7 17 # exact match 9: patient2 150 NA 18 # not overlapping 10: patient3 503 NA 19 # not overlapping 11: patient3 538 8 NA # not overlapping

解釋

完全外連接merge(..., all = TRUE)在相同的 ID 和日期上找到完全匹配,但包括兩個數據集中沒有匹配的所有其他行。

在加入之前,每個數據集都會獲得一個額外的列o來指示每個Score來源

結果是有序的,因為后續操作取決於正確的行順序。

所以,用我自己的測試用例,我們得到

m <- merge(setDT(df1)[, o := 1L], setDT(df2)[, o := 2L], 
           by = c("ID", "Days"), all = TRUE)
setorder(m, ID, Days)[]
 ID Days Score.x ox Score.y oy 1: patient1 0 NA 1 11 2 2: patient1 1 2 1 NA NA 3: patient1 25 3 1 12 2 4: patient1 233 NA NA 13 2 5: patient1 234 NA NA 14 2 6: patient1 235 4 1 NA NA 7: patient1 237 5 1 NA NA 8: patient1 248 NA NA 15 2 9: patient1 353 6 1 16 2 10: patient2 100 7 1 17 2 11: patient2 150 NA NA 18 2 12: patient3 503 NA NA 19 2 13: patient3 538 8 1 NA NA

現在,使用rleid()創建一個分組變量:

m[, g := rleid(ID,
               cumsum(c(TRUE, diff(Days) > threshold)),
               !is.na(o.x) & !is.na(o.y),
               cumsum(c(TRUE, diff(fcoalesce(o.x, o.y)) == 0L))
)][, g := rleid(g, (rowid(g) - 1L) %/% 2)][]

當滿足以下條件之一時,組計數器會提前:

  • ID改變
  • 在一個ID內,當連續Days之間的間隔超過 30 天時(因此一個 ID 內間隔為 30 天或更短的行屬於一個組或“重疊”)
  • 當一行是直接匹配時,
  • 當連續的行具有相同的原點時,從而識別出交替原點的行的條紋,例如 1、2、1、2 1, 2, 1, 2, ...2, 1, 2, 1, ...
  • 最后,在上述條紋中,計算交替起源的行對,例如,一行來自df1 ,后一行來自df2 ,或者一行來自df2 ,后一行來自df1

OP沒有明確說明最后一個條件,但這是我對

每個分數/天數/患者組合只能使用一次。 如果合並滿足所有條件但仍有可能進行雙重合並,則應使用第一個。

它確保最多折疊兩行,每行來自不同的數據集

分組后我們得到

 ID Days Score.x ox Score.y oy g 1: patient1 0 NA 1 11 2 1 2: patient1 1 2 1 NA NA 2 3: patient1 25 3 1 12 2 3 4: patient1 233 NA NA 13 2 4 5: patient1 234 NA NA 14 2 5 6: patient1 235 4 1 NA NA 5 7: patient1 237 5 1 NA NA 6 8: patient1 248 NA NA 15 2 6 9: patient1 353 6 1 16 2 7 10: patient2 100 7 1 17 2 8 11: patient2 150 NA NA 18 2 9 12: patient3 503 NA NA 19 2 10 13: patient3 538 8 1 NA NA 11

大多數組僅包含一行,少數包含 2 行,這些行在最后一步中折疊(按組聚合,返回所需的列並刪除分組變量g )。

改進的代碼

按組聚合要求對於每個組,每列僅返回一個值(長度為 1 的向量)。 (否則,組結果將由多行組成。)為簡單起見,上面的實現在所有 4 列上都使用了last()

last(Days)等價於max(Days)因為數據集是有序的。

但是,如果我理解正確,OP 更喜歡從df2返回Days值(盡管 OP 已經提到max(Days)也是可以接受的)。

為了從df2返回Days值,需要修改聚合步驟:如果組大小.N大於 1,我們從源自df2的行中選擇Days值,即oy == 2

# collapse rows where required
m[, .(ID = last(ID), 
      Days = last(if (.N > 1) Days[which(o.y == 2)] else Days), 
      Score.x = last(na.omit(Score.x)), 
      Score.y = last(na.omit(Score.y)))
  , by = g][, g := NULL][]

這將返回

 ID Days Score.x Score.y 1: patient1 0 NA 11 2: patient1 1 2 NA 3: patient1 25 3 12 4: patient1 233 NA 13 5: patient1 234 4 14 6: patient1 248 5 15 7: patient1 353 6 16 8: patient2 100 7 17 9: patient2 150 NA 18 10: patient3 503 NA 19 11: patient3 538 8 NA

現在已從df2中選擇折疊的第 5 行中的Days值 234 。

對於Score列, last()的使用根本不重要,因為在一組 2 行中應該只有一個非 NA 值。 因此, na.omit()應該只返回一個值,而last()可能只是為了保持一致性。

此代碼允許您給出一個閾值,然后將 df1 中的分數合並到 df1 作為新列。 它只會添加落在 df2 +/- 閾值的單個分數范圍內的分數。 請注意,不可能將所有分數都連接起來,因為沒有閾值可以讓所有分數唯一匹配。

threshold <- 40
WhereDF1inDF2 <- apply(sapply(lapply(df2$Days, function(x) (x+threshold):(x-threshold)), function(y) df1$Days %in% y),1,which)
useable <- sapply(WhereDF1inDF2, function(x) length(x) ==1 )
df2$Score1 <- NA
df2$Score1[unlist(WhereDF1inDF2[useable])] <- df1$Score[useable]

> df2
        ID Days Score Score1
1 patient1    0     1     NA
2 patient1   25    10     NA
3 patient1  248     3      3
4 patient1  353     4      4
5 patient2  100     5      5
6 patient2  150     7     NA
7 patient3  503     6      6

這是一個可能的data.table解決方案

library(data.table)
#convert df1 and df2 to data.table format
setDT(df1);setDT(df2)
#set colnames for later on 
#  (add .df1/.df2 suffix after Days and Score-colnamaes)
cols <- c("Days", "Score")
setnames(df1, cols, paste0( cols, ".df1" ) )
setnames(df2, cols, paste0( cols, ".df2" ) )
#update df1 with new measures from df2 (and df2 with df1)
# copies are made, to prevent changes in df1 and df2
dt1 <- copy(df1)[ df2, `:=`(Days.df2 = i.Days.df2, Score.df2 = i.Score.df2), on = .(ID, Days.df1 = Days.df2), roll = 30]
dt2 <- copy(df2)[ df1, `:=`(Days.df1 = i.Days.df1, Score.df1 = i.Score.df1), on = .(ID, Days.df2 = Days.df1), roll = -30]
#rowbind by columnnames (here the .df1/.df2 suffix is needed!), only keep unique rows
ans <- unique( rbindlist( list( dt1, dt2), use.names = TRUE ) )
#wrangle data to get to desired output
ans[, Days := ifelse( is.na(Days.df2), Days.df1, Days.df2 ) ]
ans <- ans[, .(Days, Score.x = Score.df1, Score.y = Score.df2 ), by = .(ID) ]
setkey( ans, ID, Days )  #for sorting; setorder() can also be used.
#          ID Days Score.x Score.y
# 1: patient1    0      NA       1
# 2: patient1   25       2      10
# 3: patient1  248       3       3
# 4: patient1  353       4       4
# 5: patient2  100       5       5
# 6: patient2  150      NA       7
# 7: patient3  503      NA       6
# 8: patient3  538       6      NA

以下代碼適用於您的示例數據。 根據您的條件,它應該適用於您的完整數據。 對於其他例外情況,您可以調整df31df32

df1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"),
                  "Days1" = c(0,25,235,353,100,538),
                  "Score1" = c(NA,2,3,4,5,6), 
                  stringsAsFactors = FALSE)
df2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"),
                  "Days2" = c(0,25,248,353,100,150,503),
                  "Score2" = c(1,10,3,4,5,7,6), 
                  stringsAsFactors = FALSE)

##  define a dummy sequence for each patient
df11 <- df1 %>% group_by(ID) %>% mutate(ptseq = row_number())
df21 <- df2 %>% group_by(ID) %>% mutate(ptseq = row_number())

df3 <- dplyr::full_join(df11, df21, by=c("ID","ptseq")) %>% 
         arrange(.[[1]], as.numeric(.[[2]]))

df31 <- df3 %>% mutate(Days=Days2, diff=Days1-Days2) %>% 
    mutate(Score1=ifelse(abs(diff)>30, NA, Score1))
df32 <- df3 %>% mutate(diff=Days1-Days2) %>%
     mutate(Days = case_when(abs(diff)>30 ~ Days1), Score2=c(NA), Days2=c(NA)) %>% 
     subset(!is.na(Days))

df <- rbind(df31,df32) %>%  select(ID, ptseq, Days, Score1, Score2) %>% 
         arrange(.[[1]], as.numeric(.[[2]])) %>% select(-2)

>df

ID        Days Score1 Score2
  <chr>    <dbl>  <dbl>  <dbl>
1 patient1     0     NA      1
2 patient1    25      2     10
3 patient1   248      3      3
4 patient1   353      4      4
5 patient2   100      5      5
6 patient2   150     NA      7
7 patient3   503     NA      6
8 patient3   538      6     NA

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM