简体   繁体   中英

Splitting the sequence of values of a time-varying variable, conditionally on id

In a data management step of my analyses I incurred into the following problem.

In practice, each id is recorded up to 5 times, and I have a time-varying variable of interest, tv = 1, 2, 3, 4 . Suppose my data are:

dat <- read.table(text = "

        id      tv    
        1       2
        1       2
        1       1
        1       4
        2       4
        2       1
        2       4
        3       1
        3       2
        3       3
        3       3
        3       2", 

    header=TRUE)  

What I need to do is to create two newly sets of variables starting from tv , in order to obtain:

   id     tv     tv1   tv2   tv3   tv4   tv5    dur1  dur2  dur3  dur4  dur5 
    1      2      2     1     4     0     0       2     1     1     0     0
    1      2      2     1     4     0     0       2     1     1     0     0
    1      1      2     1     4     0     0       2     1     1     0     0
    1      4      2     1     4     0     0       2     1     1     0     0
    2      4      4     1     4     0     0       1     1     1     0     0
    2      1      4     1     4     0     0       1     1     1     0     0
    2      4      4     1     4     0     0       1     1     1     0     0
    3      1      1     2     3     2     0       1     1     2     1     0
    3      2      1     2     3     2     0       1     1     2     1     0
    3      3      1     2     3     2     0       1     1     2     1     0
    3      3      1     2     3     2     0       1     1     2     1     0
    3      2      1     2     3     2     0       1     1     2     1     0

For each id , in tv1 - tv5 we have the ordered sequence of distinct (non-repeated) records of tv , while in dur1 - dur5 we have the number of times the respective distinct records are present in the original dataset dat .

I really don't know how to proceed here.. Any help will be greatly appreciated.

This should do it:

require(plyr)
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 
         3L, 3L), tv = c(2L, 2L, 1L, 4L, 4L, 1L, 4L, 1L, 2L, 3L, 3L, 2L
         )), .Names = c("id", "tv"), class = "data.frame", row.names = c(NA, 
         -12L))

out <- ddply(dat, .(id), function(x) {
    this.rle <- rle(x$tv)

    val <- this.rle$values
    val <- c(val, rep(0, 5-length(val)))
    val <- matrix(rep(val,nrow(x)), byrow=T, nrow=nrow(x))
    val <- as.data.frame(val)
    names(val) <- paste("tv", 1:5, sep="")

    len <- this.rle$lengths
    len <- c(len, rep(0, 5-length(len)))
    len <- matrix(rep(len,nrow(x)), byrow=T, nrow=nrow(x))
    len <- as.data.frame(len)
    names(len) <- paste("dur", 1:5, sep="")
    cbind(data.frame(tv=x$tv), val, len)
})

> out
   id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
1   1  2   2   1   4   0   0    2    1    1    0    0
2   1  2   2   1   4   0   0    2    1    1    0    0
3   1  1   2   1   4   0   0    2    1    1    0    0
4   1  4   2   1   4   0   0    2    1    1    0    0
5   2  4   4   1   4   0   0    1    1    1    0    0
6   2  1   4   1   4   0   0    1    1    1    0    0
7   2  4   4   1   4   0   0    1    1    1    0    0
8   3  1   1   2   3   2   0    1    1    2    1    0
9   3  2   1   2   3   2   0    1    1    2    1    0
10  3  3   1   2   3   2   0    1    1    2    1    0
11  3  3   1   2   3   2   0    1    1    2    1    0
12  3  2   1   2   3   2   0    1    1    2    1    0

Here's a solution entirely in base R. It is very similar to @Arun's answer, but will likely be faster than using "plyr":

out <- cbind(dat, do.call(
    rbind, 
    lapply(split(dat$tv, dat$id), function(x) {
        OUT <- matrix(0, ncol = 10, nrow = 1)
        T1 <- rle(x)
        OUT[1, seq_along(T1$values)] <- T1$values
        OUT[1, 6:(5+length(T1$lengths))] <- T1$lengths
        colnames(OUT) <- paste(rep(c("tv", "dur"), 
                                   each = 5), 1:5, sep ="")
        OUT[rep(1, length(x)), ]
    })))
out
#    id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
# 1   1  2   2   1   4   0   0    2    1    1    0    0
# 2   1  2   2   1   4   0   0    2    1    1    0    0
# 3   1  1   2   1   4   0   0    2    1    1    0    0
# 4   1  4   2   1   4   0   0    2    1    1    0    0
# 5   2  4   4   1   4   0   0    1    1    1    0    0
# 6   2  1   4   1   4   0   0    1    1    1    0    0
# 7   2  4   4   1   4   0   0    1    1    1    0    0
# 8   3  1   1   2   3   2   0    1    1    2    1    0
# 9   3  2   1   2   3   2   0    1    1    2    1    0
# 10  3  3   1   2   3   2   0    1    1    2    1    0
# 11  3  3   1   2   3   2   0    1    1    2    1    0
# 12  3  2   1   2   3   2   0    1    1    2    1    0

Here's a summary of what's happening:

  1. split(dat$tv, dat$id) creates a list of values in "tv" for each "id".

  2. We apply an anonymous function in which we:

    1. Create an empty one-row matrix of zeroes. We already know we need 10 columns.
    2. Store the output of rle() since we need both the "values" and "lengths"
    3. Use basic subsetting to insert "values" into the first five columns of the matrix, and "lengths" as the last five columns.
    4. Add in your column names
    5. Use a little trick to "expand" your matrix to a specified number of rows, in this case, the same number of rows as there are rows per group.
  3. do.call(rbind... puts all the matrices together, binding them by rows.

  4. cbind(dat... binds the original data.frame to the result from steps 1 to 3.

Again, conceptually, this is very similar to Arun's answer--the use of rle() was probably what you were missing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM