简体   繁体   中英

coercing data frame rows to matrix in R

I'm unsure of better terminology for my question, so forgive me for the long winded approach.

I'm trying to use two identifying variables, id and duration to fill up the rows of a matrix where the columns denote half hour periods (so there should be 6 for a 3 hour period) and the rows are a given person's activities in those time periods. If the activities do not fill up the matrix, a dummy variable should be used instead. I've written an example below which should help clarify.

Example: data has 3 columns, id , activity , and duration . id and duration should serve as identifying variables and activity should serve as the variable in the matrix.

data <- data.frame(id = c(1, 1, 1, 2, 2, 3, 3, 3), 
               activity = c("a", "b", "c", "d", "e", "b", "b", "a"), 
               duration = c(60, 30, 90, 45, 30, 15, 60, 100))

For the example, I used a 3-hour duration hence the 6 columns in the matrix. The matrix below is the wanted output. There are DUMMY instances where the total duration of a person's activities does not sum to the duration of the matrix. In this example, the total duration is 180 (3 hours * 60), so person 2 who's activity duration sums to 75 (45 + 30) will get the DUMMY variable after the activities for the first 75 minutes are done.

mat <- t(matrix(c("a", "a", "b", "c", "c", "c",
            "d", "d", "e", "DUMMY", "DUMMY", "DUMMY",
            "b", "b", "b", "a", "a", "a"), 
          nrow = 6, ncol = 3))
colnames(mat) <- c("0", "30", "60", "90", "120", "150")

I'm unsure how to fill the matrix mat above with the data above. Any help would be appreciated. Please let me know if the question needs to be made clearer.

EDIT: edited output

EDIT2: Added matrix column names

EDIT3: Added info on dummy variable

EDIT4: Would it be easier if I added start and end time instead of duration?

An approach would be to locate the activities for every 30-min interval by "id":

ints = seq(0, by = 30, length.out = 6)

data2 = do.call(rbind, 
            lapply(split(data, data$id),
                   function(d) {
                      dur = d$duration
                      i = findInterval(ints, c(cumsum(c(0, dur[-length(dur)])), sum(dur))) 
                      data.frame(id = d$id[1], ints = ints, activity = d$activity[i])
                    }))

And on the new "data.frame":

tapply(as.character(data2$activity), data2[c("id", "ints")], identity)
#   ints
#id  0   30  60  90  120 150
#  1 "a" "a" "b" "c" "c" "c"
#  2 "d" "d" "e" NA  NA  NA 
#  3 "b" "b" "b" "a" "a" "a"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM