简体   繁体   English

根据ID拆分时变变量的值序列

[英]Splitting the sequence of values of a time-varying variable, conditionally on id

In a data management step of my analyses I incurred into the following problem. 在分析的数据管理步骤中,我遇到了以下问题。

In practice, each id is recorded up to 5 times, and I have a time-varying variable of interest, tv = 1, 2, 3, 4 . 实际上,每个id最多记录5次,并且我有一个随时间变化的变量tv = 1, 2, 3, 4 Suppose my data are: 假设我的数据是:

dat <- read.table(text = "

        id      tv    
        1       2
        1       2
        1       1
        1       4
        2       4
        2       1
        2       4
        3       1
        3       2
        3       3
        3       3
        3       2", 

    header=TRUE)  

What I need to do is to create two newly sets of variables starting from tv , in order to obtain: 我需要做的是从tv创建两个新的变量集,以获得:

   id     tv     tv1   tv2   tv3   tv4   tv5    dur1  dur2  dur3  dur4  dur5 
    1      2      2     1     4     0     0       2     1     1     0     0
    1      2      2     1     4     0     0       2     1     1     0     0
    1      1      2     1     4     0     0       2     1     1     0     0
    1      4      2     1     4     0     0       2     1     1     0     0
    2      4      4     1     4     0     0       1     1     1     0     0
    2      1      4     1     4     0     0       1     1     1     0     0
    2      4      4     1     4     0     0       1     1     1     0     0
    3      1      1     2     3     2     0       1     1     2     1     0
    3      2      1     2     3     2     0       1     1     2     1     0
    3      3      1     2     3     2     0       1     1     2     1     0
    3      3      1     2     3     2     0       1     1     2     1     0
    3      2      1     2     3     2     0       1     1     2     1     0

For each id , in tv1 - tv5 we have the ordered sequence of distinct (non-repeated) records of tv , while in dur1 - dur5 we have the number of times the respective distinct records are present in the original dataset dat . 对于每个id ,在tv1 - tv5我们有不同的(不重复)的记录的有序序列tv ,而在dur1 - dur5我们的次数相应的不同的记录存在于原始数据集dat

I really don't know how to proceed here.. Any help will be greatly appreciated. 我真的不知道该如何进行。任何帮助将不胜感激。

This should do it: 应该这样做:

require(plyr)
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 
         3L, 3L), tv = c(2L, 2L, 1L, 4L, 4L, 1L, 4L, 1L, 2L, 3L, 3L, 2L
         )), .Names = c("id", "tv"), class = "data.frame", row.names = c(NA, 
         -12L))

out <- ddply(dat, .(id), function(x) {
    this.rle <- rle(x$tv)

    val <- this.rle$values
    val <- c(val, rep(0, 5-length(val)))
    val <- matrix(rep(val,nrow(x)), byrow=T, nrow=nrow(x))
    val <- as.data.frame(val)
    names(val) <- paste("tv", 1:5, sep="")

    len <- this.rle$lengths
    len <- c(len, rep(0, 5-length(len)))
    len <- matrix(rep(len,nrow(x)), byrow=T, nrow=nrow(x))
    len <- as.data.frame(len)
    names(len) <- paste("dur", 1:5, sep="")
    cbind(data.frame(tv=x$tv), val, len)
})

> out
   id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
1   1  2   2   1   4   0   0    2    1    1    0    0
2   1  2   2   1   4   0   0    2    1    1    0    0
3   1  1   2   1   4   0   0    2    1    1    0    0
4   1  4   2   1   4   0   0    2    1    1    0    0
5   2  4   4   1   4   0   0    1    1    1    0    0
6   2  1   4   1   4   0   0    1    1    1    0    0
7   2  4   4   1   4   0   0    1    1    1    0    0
8   3  1   1   2   3   2   0    1    1    2    1    0
9   3  2   1   2   3   2   0    1    1    2    1    0
10  3  3   1   2   3   2   0    1    1    2    1    0
11  3  3   1   2   3   2   0    1    1    2    1    0
12  3  2   1   2   3   2   0    1    1    2    1    0

Here's a solution entirely in base R. It is very similar to @Arun's answer, but will likely be faster than using "plyr": 这完全是基于R的解决方案。它与@Arun的答案非常相似,但可能比使用“ plyr”要快:

out <- cbind(dat, do.call(
    rbind, 
    lapply(split(dat$tv, dat$id), function(x) {
        OUT <- matrix(0, ncol = 10, nrow = 1)
        T1 <- rle(x)
        OUT[1, seq_along(T1$values)] <- T1$values
        OUT[1, 6:(5+length(T1$lengths))] <- T1$lengths
        colnames(OUT) <- paste(rep(c("tv", "dur"), 
                                   each = 5), 1:5, sep ="")
        OUT[rep(1, length(x)), ]
    })))
out
#    id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
# 1   1  2   2   1   4   0   0    2    1    1    0    0
# 2   1  2   2   1   4   0   0    2    1    1    0    0
# 3   1  1   2   1   4   0   0    2    1    1    0    0
# 4   1  4   2   1   4   0   0    2    1    1    0    0
# 5   2  4   4   1   4   0   0    1    1    1    0    0
# 6   2  1   4   1   4   0   0    1    1    1    0    0
# 7   2  4   4   1   4   0   0    1    1    1    0    0
# 8   3  1   1   2   3   2   0    1    1    2    1    0
# 9   3  2   1   2   3   2   0    1    1    2    1    0
# 10  3  3   1   2   3   2   0    1    1    2    1    0
# 11  3  3   1   2   3   2   0    1    1    2    1    0
# 12  3  2   1   2   3   2   0    1    1    2    1    0

Here's a summary of what's happening: 这是正在发生的事情的摘要:

  1. split(dat$tv, dat$id) creates a list of values in "tv" for each "id". split(dat$tv, dat$id)为每个“ id”在“ tv”中创建值列表。

  2. We apply an anonymous function in which we: 我们应用匿名函数,其中:

    1. Create an empty one-row matrix of zeroes. 创建一个零的空单行矩阵。 We already know we need 10 columns. 我们已经知道我们需要10列。
    2. Store the output of rle() since we need both the "values" and "lengths" 存储rle()的输出,因为我们需要“值”和“长度”
    3. Use basic subsetting to insert "values" into the first five columns of the matrix, and "lengths" as the last five columns. 使用基本子集将“值”插入矩阵的前五列,并将“长度”插入后五列。
    4. Add in your column names 添加您的列名
    5. Use a little trick to "expand" your matrix to a specified number of rows, in this case, the same number of rows as there are rows per group. 使用一些技巧将矩阵“扩展”到指定的行数,在这种情况下,行数与每组中的行数相同。
  3. do.call(rbind... puts all the matrices together, binding them by rows. do.call(rbind...将所有矩阵放在一起,按行绑定它们。

  4. cbind(dat... binds the original data.frame to the result from steps 1 to 3. cbind(dat...将原始data.frame绑定到步骤1至3的结果。

Again, conceptually, this is very similar to Arun's answer--the use of rle() was probably what you were missing. 再次,从概念上讲,这与Arun的答案非常相似-使用rle()可能正是您所缺少的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM