In a data management step of my analyses I incurred into the following problem.
In practice, each id
is recorded up to 5 times, and I have a time-varying variable of interest, tv = 1, 2, 3, 4
. Suppose my data are:
dat <- read.table(text = "
id tv
1 2
1 2
1 1
1 4
2 4
2 1
2 4
3 1
3 2
3 3
3 3
3 2",
header=TRUE)
What I need to do is to create two newly sets of variables starting from tv
, in order to obtain:
id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
1 2 2 1 4 0 0 2 1 1 0 0
1 2 2 1 4 0 0 2 1 1 0 0
1 1 2 1 4 0 0 2 1 1 0 0
1 4 2 1 4 0 0 2 1 1 0 0
2 4 4 1 4 0 0 1 1 1 0 0
2 1 4 1 4 0 0 1 1 1 0 0
2 4 4 1 4 0 0 1 1 1 0 0
3 1 1 2 3 2 0 1 1 2 1 0
3 2 1 2 3 2 0 1 1 2 1 0
3 3 1 2 3 2 0 1 1 2 1 0
3 3 1 2 3 2 0 1 1 2 1 0
3 2 1 2 3 2 0 1 1 2 1 0
For each id
, in tv1
- tv5
we have the ordered sequence of distinct (non-repeated) records of tv
, while in dur1
- dur5
we have the number of times the respective distinct records are present in the original dataset dat
.
I really don't know how to proceed here.. Any help will be greatly appreciated.
This should do it:
require(plyr)
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L), tv = c(2L, 2L, 1L, 4L, 4L, 1L, 4L, 1L, 2L, 3L, 3L, 2L
)), .Names = c("id", "tv"), class = "data.frame", row.names = c(NA,
-12L))
out <- ddply(dat, .(id), function(x) {
this.rle <- rle(x$tv)
val <- this.rle$values
val <- c(val, rep(0, 5-length(val)))
val <- matrix(rep(val,nrow(x)), byrow=T, nrow=nrow(x))
val <- as.data.frame(val)
names(val) <- paste("tv", 1:5, sep="")
len <- this.rle$lengths
len <- c(len, rep(0, 5-length(len)))
len <- matrix(rep(len,nrow(x)), byrow=T, nrow=nrow(x))
len <- as.data.frame(len)
names(len) <- paste("dur", 1:5, sep="")
cbind(data.frame(tv=x$tv), val, len)
})
> out
id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
1 1 2 2 1 4 0 0 2 1 1 0 0
2 1 2 2 1 4 0 0 2 1 1 0 0
3 1 1 2 1 4 0 0 2 1 1 0 0
4 1 4 2 1 4 0 0 2 1 1 0 0
5 2 4 4 1 4 0 0 1 1 1 0 0
6 2 1 4 1 4 0 0 1 1 1 0 0
7 2 4 4 1 4 0 0 1 1 1 0 0
8 3 1 1 2 3 2 0 1 1 2 1 0
9 3 2 1 2 3 2 0 1 1 2 1 0
10 3 3 1 2 3 2 0 1 1 2 1 0
11 3 3 1 2 3 2 0 1 1 2 1 0
12 3 2 1 2 3 2 0 1 1 2 1 0
Here's a solution entirely in base R. It is very similar to @Arun's answer, but will likely be faster than using "plyr":
out <- cbind(dat, do.call(
rbind,
lapply(split(dat$tv, dat$id), function(x) {
OUT <- matrix(0, ncol = 10, nrow = 1)
T1 <- rle(x)
OUT[1, seq_along(T1$values)] <- T1$values
OUT[1, 6:(5+length(T1$lengths))] <- T1$lengths
colnames(OUT) <- paste(rep(c("tv", "dur"),
each = 5), 1:5, sep ="")
OUT[rep(1, length(x)), ]
})))
out
# id tv tv1 tv2 tv3 tv4 tv5 dur1 dur2 dur3 dur4 dur5
# 1 1 2 2 1 4 0 0 2 1 1 0 0
# 2 1 2 2 1 4 0 0 2 1 1 0 0
# 3 1 1 2 1 4 0 0 2 1 1 0 0
# 4 1 4 2 1 4 0 0 2 1 1 0 0
# 5 2 4 4 1 4 0 0 1 1 1 0 0
# 6 2 1 4 1 4 0 0 1 1 1 0 0
# 7 2 4 4 1 4 0 0 1 1 1 0 0
# 8 3 1 1 2 3 2 0 1 1 2 1 0
# 9 3 2 1 2 3 2 0 1 1 2 1 0
# 10 3 3 1 2 3 2 0 1 1 2 1 0
# 11 3 3 1 2 3 2 0 1 1 2 1 0
# 12 3 2 1 2 3 2 0 1 1 2 1 0
Here's a summary of what's happening:
split(dat$tv, dat$id)
creates a list of values in "tv" for each "id".
We apply an anonymous function in which we:
rle()
since we need both the "values" and "lengths" do.call(rbind...
puts all the matrices together, binding them by rows.
cbind(dat...
binds the original data.frame
to the result from steps 1 to 3.
Again, conceptually, this is very similar to Arun's answer--the use of rle()
was probably what you were missing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.