简体   繁体   中英

Complex long to wide data transformation (with time-varying variable)

I am currently working on a Multistate Analysis dataset in "long" form (one row for each individual's observation; each individual is repeatedly measured up to 5 times).

The idea is that each individual can recurrently transition across the levels of the time-varying state variable s = 1, 2, 3, 4 . All the other variables that I have (here cohort ) are fixed within any given id .

After some analyses, I need to reshape the dataset in "wide" form, according to the specific sequence of visited states. Here is an example of the initial long data:

  dat <- read.table(text = "

        id    cohort    s    
        1       1       2
        1       1       2
        1       1       1
        1       1       4
        2       3       1
        2       3       1
        2       3       3
        3       2       1
        3       2       2
        3       2       3
        3       2       3
        3       2       4", 

    header=TRUE)     

The final "wide" dataset should take into account the specific individual sequence of visited states, recorded into the newly created variables s1 , s2 , s3 , s4 , s5 , where s1 is the first state visited by the individual and so on.

According to the above example, the wide dataset looks like:

    id    cohort    s1    s2    s3    s4    s5    
    1       1       2      2     1     4     0
    2       3       1      1     3     0     0
    3       2       1      2     3     3     4

I tried to use reshape() , and also to focus on transposing s , but without the intended result. Actually, my knowledge of the R functions is quite limited.. Can you give any suggestion? Thanks.

EDIT: obtaining a different kind of wide dataset

Thank you all for your help, I have a related question if I can. Especially when each individual is observed for a long time and there are few transitions across states, it is very useful to reshape the initial sample dat in this alternative way:

    id    cohort    s1    s2    s3    s4    s5    dur1  dur2  dur3  dur4  dur5 
    1       1       2      1     4     0     0      2     1     1     0     0  
    2       3       1      3     0     0     0      2     1     0     0     0
    3       2       1      2     3     4     0      1     1     2     1     0

In practice now s1 - s5 are the distinct visited states, and dur1 - dur5 the time spent in each respective distinct visited state.

Can you please give a hand for reaching this data structure? I believe it is necessary to create all the dur - and s - variables in an intermediate sample before using reshape() . Otherwise maybe it is possible to directly adopt -reshape2- ?

dat <- read.table(text = "
        id    cohort    s    
        1       1       2
        1       1       2
        1       1       1
        1       1       4
        2       3       1
        2       3       1
        2       3       3
        3       2       1
        3       2       2
        3       2       3
        3       2       3
        3       2       4", 
    header=TRUE)     

df <- data.frame(
    dat,
    period = sequence(rle(dat$id)$lengths) 
)

wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
                timevar = "period", direction = "wide")

wide[is.na(wide)] = 0
wide

Gives:

  id cohort s.1 s.2 s.3 s.4 s.5
1  1      1   2   2   1   4   0
5  2      3   1   1   3   0   0
8  3      2   1   2   3   3   4

then using the following line gives your names:

names(wide) <- c('id','cohort', paste('s', seq_along(1:5), sep=''))

#   id cohort s1 s2 s3 s4 s5
# 1  1      1  2  2  1  4  0
# 5  2      3  1  1  3  0  0
# 8  3      2  1  2  3  3  4

If you use sep='' in the wide statement you do not have to rename the variables:

wide <- reshape(df, v.names = "s", idvar = c("id", "cohort"),
                timevar = "period", direction = "wide", sep='')

I suspect there are ways to avoid creating the period variable and avoid replacing NA directly in the wide statement, but I have not figured those out yet.

ok...

library(plyr)
library(reshape2)

dat2 <- ddply(dat,.(id,cohort), function(x) 
       data.frame(s=x$s,name=paste0("s",seq_along(x$s))))


dat2 <- ddply(dat2,.(id,cohort), function(x) 
       dcast(x, id + cohort ~ name, value.var= "s" ,fill= 0)
       )

dat2[is.na(dat2)] <- 0

dat2

#    id cohort s1 s2 s3 s4 s5
#    1  1      1  2  2  1  4  0
#    2  2      3  1  1  3  0  0
#    3  3      2  1  2  3  3  4

This seem right? I admit the first ddply is hardly elegant.

Try this:

library(reshape2)

dat$seq <- ave(dat$id, dat$id, FUN = function(x) paste0("s", seq_along(x)))
dat.s <- dcast(dat, id + cohort ~ seq, value.var = "s", fill = 0)

which gives this:

> dat.s
  id cohort s1 s2 s3 s4 s5
1  1      1  2  2  1  4  0
2  2      3  1  1  3  0  0
3  3      2  1  2  3  3  4

If you did not mind using just 1, 2, ..., 5 as column names then you could shorten the ave line to just:

dat$seq <- ave(dat$id, dat$id, FUN = seq_along)

Regarding the second question that was added later try this:

library(plyr)
dur.fn <- function(x) {
  r <- rle(x$s)$length
  data.frame(id = x$id[1], dur.value = r, dur.seq = paste0("dur", seq_along(r)))
}
dat.dur.long <- ddply(dat, .(id), dur.fn)
dat.dur <- dcast(dat.dur.long, id ~ dur.seq, c, value.var = "dur.value", fill = 0)
cbind(dat.s, dat.dur[-1])

which gives:

  id cohort s1 s2 s3 s4 s5 dur1 dur2 dur3 dur4
1  1      1  2  2  1  4  0    2    1    1    0
2  2      3  1  1  3  0  0    2    1    0    0
3  3      2  1  2  3  3  4    1    1    2    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM