I am attempting to develop a time varying Cox proportional hazards (CPH) model in R and was wondering if anyone has generated any code to help format data for the counting structure that is used in time varying / time dependent CPH models.
To make the problem reproducible and somewhat simpler, I have extracted the first 100 rows of data, which features 4 variables ( id
, date
, y
, and x
). The id
is a unique subject identifier. The date
is an integer sequence from 0 to n days of observation for each id
. y
is the status or outcome of the hazard analysis and x
is the time varying covariate. In this example, once y
= 1 has occurred the data for each subject will be censored and no additional data should be included in the ideal output dataframe.
The data are structured so that each subject has 1 row that corresponds to each day of observation.
head(test)
id date y x
1 0 0 0
1 1 0 1
1 2 0 1
1 3 0 1
1 4 0 1
1 5 0 0
However, as I understand it, the cph
function in R requires that time varying covariates be structured in such a way that the start
and end
variables need to be recoded into 3 rows with intervals from (0,1] and (1,5] and (5,6] for the data featured in the head(test)
code block above.
The first 100 rows of data can be reconstructed using this code:
dput(test)
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9,
9, 9, 9), date = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2,
3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
0, 1, 2, 3, 4, 5, 6, 7, 8), y = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0), x = c(0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("id",
"date", "y", "x"), row.names = c(NA, -100L), class = "data.frame")
Ideally, I am trying to recode these data so that the output would be:
head(ideal_output)
id start end y x
1 0 1 0 0
1 1 5 0 1
1 5 6 0 0
1 6 7 0 1
1 7 9 0 0
1 9 11 0 1
1 11 20 0 0
2 0 8 0 0
3 0 1 0 0
3 1 3 0 1
3 3 4 0 0
3 4 6 0 1
3 6 7 1 1
4 0 2 0 0
4 2 4 0 1
4 4 7 0 0
5 0 9 0 0
6 0 7 0 0
7 0 1 0 0
7 1 2 0 1
7 2 3 0 0
7 3 4 1 0
8 0 3 0 0
8 3 4 1 1
9 0 2 0 0
9 2 5 0 1
9 5 6 1 1
I have done this manually to create the ideal_output
above but it is an error prone process and untenable for the hundreds of id
's and several covariates that I need to evaluate. Consequently, any help would be greatly appreciated in developing an automated way to approach this data formatting challenge. Thanks!
I think the Survsplit() function is the answer to your problem.
look at: http://www.rdocumentation.org/packages/eha/functions/SurvSplit
Alternatively, try to google: Chapter 5 Extended and Stratified Cox - nus.edu.sg
As @Ham suggest you can use tmerge
. Here is an example
> #####
> # `dat` is the data.frame you provided
> library(survival)
>
> # make baseline data.frame for tmerge
> baseline <- by(dat, dat$id, function(x){
+ n <- nrow(x)
+ # avoid slow data.frame call
+ structure(list(
+ id = x$id[1], start = x$date[1], x = x$x[1], end = x$date[n],
+ dummy = 0),
+ row.names = 1L, class = "data.frame")
+ })
> baseline <- do.call(rbind, baseline)
> baseline # show baseline data
id start x end dummy
1 1 0 0 19 0
2 2 0 0 7 0
3 3 0 0 12 0
4 4 0 0 6 0
5 5 0 0 8 0
6 6 0 0 6 0
7 7 0 0 11 0
8 8 0 0 14 0
9 9 0 0 8 0
>
> # use tmerge
> final_dat <- tmerge(baseline, baseline, id = id, y = event(end, dummy))
> final_dat <- tmerge(
+ final_dat, dat, id = id, y = cumtdc(date, y), x = tdc(date, x))
> final_dat[final_dat$id == 3, ] # look at one example
id start x end dummy tstart tstop y
27 3 0 0 12 0 0 1 0
28 3 0 1 12 0 1 2 0
29 3 0 1 12 0 2 3 0
30 3 0 0 12 0 3 4 0
31 3 0 1 12 0 4 5 0
32 3 0 1 12 0 5 6 0
33 3 0 1 12 0 6 7 1
34 3 0 1 12 0 7 8 1
35 3 0 1 12 0 8 9 1
36 3 0 1 12 0 9 10 1
37 3 0 1 12 0 10 11 1
38 3 0 0 12 0 11 12 1
>
> # remove values where y is not zero or y is not the first non-zero value
> final_dat <- within(final_dat, ycum <- unlist(tapply(y, id, cumsum)))
> final_dat <- final_dat[final_dat$ycum < 2, ]
> final_dat$ycum <- NULL
> final_dat[final_dat$id == 3, ]
id start x end dummy tstart tstop y
27 3 0 0 12 0 0 1 0
28 3 0 1 12 0 1 2 0
29 3 0 1 12 0 2 3 0
30 3 0 0 12 0 3 4 0
31 3 0 1 12 0 4 5 0
32 3 0 1 12 0 5 6 0
33 3 0 1 12 0 6 7 1
>
> # remove x row where the previous x value do match. But
> # * keep those where y = 1
> # * update tstop for the last row where the last row may be removed
> final_dat <- within(
+ final_dat,
+ max_t <- unlist(tapply(tstop, id, function(z) rep(max(z), length(z)))))
> final_dat <- within(
+ final_dat,
+ keep <- unlist(tapply(x, id, function(z)
+ c(TRUE, z[-1] != z[-length(z)]))))
>
> final_dat <- final_dat[final_dat$keep | final_dat$y, ]
>
> final_dat <- within(
+ final_dat, is_last <- unlist(tapply(id, id, function(z)
+ seq_along(z) == length(z))))
>
> needs_update <- final_dat$is_last & !final_dat$y
> final_dat[needs_update, "tstop"] <-
+ final_dat[needs_update, "max_t"] + 1
>
> # have to update the tstop column
> final_dat <- within(final_dat, tstop <- unlist(by(
+ cbind(tstart, tstop), id, function(z) {
+ n <- nrow(z)
+ c(z$tstart[-1], z$tstop[n])
+ })))
>
> # show final data.frame
> final_dat[, c("id", "tstart", "tstop", "y", "x")]
id tstart tstop y x
1 1 0 1 0 0
2 1 1 5 0 1
6 1 5 6 0 0
7 1 6 7 0 1
8 1 7 9 0 0
10 1 9 11 0 1
12 1 11 20 0 0
20 2 0 8 0 0
27 3 0 1 0 0
28 3 1 3 0 1
30 3 3 4 0 0
31 3 4 6 0 1
33 3 6 7 1 1
39 4 0 2 0 0
41 4 2 4 0 1
43 4 4 7 0 0
45 5 0 9 0 0
53 6 0 7 0 0
59 7 0 1 0 0
60 7 1 2 0 1
61 7 2 3 0 0
62 7 3 4 1 0
70 8 0 3 0 0
73 8 3 4 1 1
84 9 0 2 0 0
86 9 2 5 0 1
89 9 5 6 1 1
The code after tmerge
can be done faster with dplyr
or data.table
. If you have more columns than just one, x
, then I suggest that you: 1) store a column index of dat
and use that in tmerge
in the tdc
function instead of x
. Then merge the tables afterwards with merge
. Further, you need to update the line that makes the keep
indicator. Otherwise the code should be identical.
I think the tmerge() function is the answer to your problem.
look at: https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.