[英]Splitting the sequence of values of a time-varying variable, conditionally on id
In a data management step of my analyses I incurred into the following problem. 在分析的数据管理步骤中,我遇到了以下问题。
In practice, each id
is recorded up to 5 times, and I have a time-varying variable of interest, tv = 1, 2, 3, 4
. 实际上,每个
id
最多记录5次,并且我有一个随时间变化的变量tv = 1, 2, 3, 4
。 Suppose my data are: 假设我的数据是:
dat <- read.table(text = "
id tv
1 2
1 2
1 1
1 4
2 4
2 1
2 4
3 1
3 2
3 3
3 3
3 2",
header=TRUE)
What I need to do is to create two newly sets of variables starting from tv
, in order to obtain: 我需要做的是从
tv
创建两个新的变量集,以获得:
id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
1 2 2 1 4 0 0 2 1 1 0 0
1 2 2 1 4 0 0 2 1 1 0 0
1 1 2 1 4 0 0 2 1 1 0 0
1 4 2 1 4 0 0 2 1 1 0 0
2 4 4 1 4 0 0 1 1 1 0 0
2 1 4 1 4 0 0 1 1 1 0 0
2 4 4 1 4 0 0 1 1 1 0 0
3 1 1 2 3 2 0 1 1 2 1 0
3 2 1 2 3 2 0 1 1 2 1 0
3 3 1 2 3 2 0 1 1 2 1 0
3 3 1 2 3 2 0 1 1 2 1 0
3 2 1 2 3 2 0 1 1 2 1 0
For each id
, in tv1
- tv5
we have the ordered sequence of distinct (non-repeated) records of tv
, while in dur1
- dur5
we have the number of times the respective distinct records are present in the original dataset dat
. 对于每个
id
,在tv1
- tv5
我们有不同的(不重复)的记录的有序序列tv
,而在dur1
- dur5
我们的次数相应的不同的记录存在于原始数据集dat
。
I really don't know how to proceed here.. Any help will be greatly appreciated. 我真的不知道该如何进行。任何帮助将不胜感激。
This should do it: 应该这样做:
require(plyr)
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L), tv = c(2L, 2L, 1L, 4L, 4L, 1L, 4L, 1L, 2L, 3L, 3L, 2L
)), .Names = c("id", "tv"), class = "data.frame", row.names = c(NA,
-12L))
out <- ddply(dat, .(id), function(x) {
this.rle <- rle(x$tv)
val <- this.rle$values
val <- c(val, rep(0, 5-length(val)))
val <- matrix(rep(val,nrow(x)), byrow=T, nrow=nrow(x))
val <- as.data.frame(val)
names(val) <- paste("tv", 1:5, sep="")
len <- this.rle$lengths
len <- c(len, rep(0, 5-length(len)))
len <- matrix(rep(len,nrow(x)), byrow=T, nrow=nrow(x))
len <- as.data.frame(len)
names(len) <- paste("dur", 1:5, sep="")
cbind(data.frame(tv=x$tv), val, len)
})
> out
id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
1 1 2 2 1 4 0 0 2 1 1 0 0
2 1 2 2 1 4 0 0 2 1 1 0 0
3 1 1 2 1 4 0 0 2 1 1 0 0
4 1 4 2 1 4 0 0 2 1 1 0 0
5 2 4 4 1 4 0 0 1 1 1 0 0
6 2 1 4 1 4 0 0 1 1 1 0 0
7 2 4 4 1 4 0 0 1 1 1 0 0
8 3 1 1 2 3 2 0 1 1 2 1 0
9 3 2 1 2 3 2 0 1 1 2 1 0
10 3 3 1 2 3 2 0 1 1 2 1 0
11 3 3 1 2 3 2 0 1 1 2 1 0
12 3 2 1 2 3 2 0 1 1 2 1 0
Here's a solution entirely in base R. It is very similar to @Arun's answer, but will likely be faster than using "plyr": 这完全是基于R的解决方案。它与@Arun的答案非常相似,但可能比使用“ plyr”要快:
out <- cbind(dat, do.call(
rbind,
lapply(split(dat$tv, dat$id), function(x) {
OUT <- matrix(0, ncol = 10, nrow = 1)
T1 <- rle(x)
OUT[1, seq_along(T1$values)] <- T1$values
OUT[1, 6:(5+length(T1$lengths))] <- T1$lengths
colnames(OUT) <- paste(rep(c("tv", "dur"),
each = 5), 1:5, sep ="")
OUT[rep(1, length(x)), ]
})))
out
# id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
# 1 1 2 2 1 4 0 0 2 1 1 0 0
# 2 1 2 2 1 4 0 0 2 1 1 0 0
# 3 1 1 2 1 4 0 0 2 1 1 0 0
# 4 1 4 2 1 4 0 0 2 1 1 0 0
# 5 2 4 4 1 4 0 0 1 1 1 0 0
# 6 2 1 4 1 4 0 0 1 1 1 0 0
# 7 2 4 4 1 4 0 0 1 1 1 0 0
# 8 3 1 1 2 3 2 0 1 1 2 1 0
# 9 3 2 1 2 3 2 0 1 1 2 1 0
# 10 3 3 1 2 3 2 0 1 1 2 1 0
# 11 3 3 1 2 3 2 0 1 1 2 1 0
# 12 3 2 1 2 3 2 0 1 1 2 1 0
Here's a summary of what's happening: 这是正在发生的事情的摘要:
split(dat$tv, dat$id)
creates a list of values in "tv" for each "id". split(dat$tv, dat$id)
为每个“ id”在“ tv”中创建值列表。
We apply an anonymous function in which we: 我们应用匿名函数,其中:
rle()
since we need both the "values" and "lengths" rle()
的输出,因为我们需要“值”和“长度” do.call(rbind...
puts all the matrices together, binding them by rows. do.call(rbind...
将所有矩阵放在一起,按行绑定它们。
cbind(dat...
binds the original data.frame
to the result from steps 1 to 3. cbind(dat...
将原始data.frame
绑定到步骤1至3的结果。
Again, conceptually, this is very similar to Arun's answer--the use of rle()
was probably what you were missing. 再次,从概念上讲,这与Arun的答案非常相似-使用
rle()
可能正是您所缺少的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.