简体   繁体   English

将数据帧行强制转换为R中的矩阵

[英]coercing data frame rows to matrix in R

I'm unsure of better terminology for my question, so forgive me for the long winded approach. 我不确定要问的是更好的术语,所以请原谅我漫长的方法。

I'm trying to use two identifying variables, id and duration to fill up the rows of a matrix where the columns denote half hour periods (so there should be 6 for a 3 hour period) and the rows are a given person's activities in those time periods. 我正在尝试使用两个标识变量idduration来填充矩阵的行,其中的列表示半小时的时间段(因此3小时的时间段应为6),而这些行是给定人员在其中的活动时间段。 If the activities do not fill up the matrix, a dummy variable should be used instead. 如果活动未填满矩阵,则应改用虚拟变量。 I've written an example below which should help clarify. 我在下面写了一个示例,应该有助于阐明。

Example: data has 3 columns, id , activity , and duration . 示例:数据有3列,分别是idactivityduration id and duration should serve as identifying variables and activity should serve as the variable in the matrix. id和持续时间应作为标识变量, activity应作为矩阵中的变量。

data <- data.frame(id = c(1, 1, 1, 2, 2, 3, 3, 3), 
               activity = c("a", "b", "c", "d", "e", "b", "b", "a"), 
               duration = c(60, 30, 90, 45, 30, 15, 60, 100))

For the example, I used a 3-hour duration hence the 6 columns in the matrix. 对于示例,我使用了3个小时的时间,因此矩阵中的6列。 The matrix below is the wanted output. 下面的矩阵是所需的输出。 There are DUMMY instances where the total duration of a person's activities does not sum to the duration of the matrix. 在一些DUMMY实例中,一个人的活动的总持续时间不等于矩阵的持续时间。 In this example, the total duration is 180 (3 hours * 60), so person 2 who's activity duration sums to 75 (45 + 30) will get the DUMMY variable after the activities for the first 75 minutes are done. 在此示例中,总持续时间为180(3小时* 60),因此活动持续时间总计为75(45 + 30)的个人2将在完成前75分钟的活动后获得DUMMY变量。

mat <- t(matrix(c("a", "a", "b", "c", "c", "c",
            "d", "d", "e", "DUMMY", "DUMMY", "DUMMY",
            "b", "b", "b", "a", "a", "a"), 
          nrow = 6, ncol = 3))
colnames(mat) <- c("0", "30", "60", "90", "120", "150")

I'm unsure how to fill the matrix mat above with the data above. 我不能确定如何填满基mat上面与上面的数据。 Any help would be appreciated. 任何帮助,将不胜感激。 Please let me know if the question needs to be made clearer. 请让我知道是否需要弄清楚这个问题。

EDIT: edited output 编辑:编辑输出

EDIT2: Added matrix column names EDIT2:添加的矩阵列名称

EDIT3: Added info on dummy variable EDIT3:添加了有关虚拟变量的信息

EDIT4: Would it be easier if I added start and end time instead of duration? EDIT4:如果添加开始和结束时间而不是持续时间,会更容易吗?

An approach would be to locate the activities for every 30-min interval by "id": 一种方法是通过“ id”每隔30分钟定位一次活动:

ints = seq(0, by = 30, length.out = 6)

data2 = do.call(rbind, 
            lapply(split(data, data$id),
                   function(d) {
                      dur = d$duration
                      i = findInterval(ints, c(cumsum(c(0, dur[-length(dur)])), sum(dur))) 
                      data.frame(id = d$id[1], ints = ints, activity = d$activity[i])
                    }))

And on the new "data.frame": 并在新的“ data.frame”上:

tapply(as.character(data2$activity), data2[c("id", "ints")], identity)
#   ints
#id  0   30  60  90  120 150
#  1 "a" "a" "b" "c" "c" "c"
#  2 "d" "d" "e" NA  NA  NA 
#  3 "b" "b" "b" "a" "a" "a"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM