与data.table分组后获取row.name

Question

I'm new with data.table but I've managed to reduce a computation in a dataset of 600K rows from thousands of seconds (using *ply loops) to 1.7sec. 我是data.table的新手，但我设法将60万行数据集中的计算量从数千秒（使用* ply循环）减少到1.7sec。 Basically I need the row with the lowest value in the column class in the groups of the same group and start . 基本上，我需要在同一组的组中的列类中具有最低值的行并开始。 I'm using 我正在使用

DT[, list(class=min(class)), by=list(group, start)]

But to do that I created DT with only these 3 columns from a data.frame with more columns. 但是要做到这一点，我从data.frame中创建了仅包含这三列的DT，其中包含更多列。 So, to merge my results with the original data.frame I'm thinking of using the row.name, so I created DT with row.name=TRUE and this is an example of what I have: 因此，为了将结果与原始data.frame合并，我正在考虑使用row.name，因此我创建了具有row.name = TRUE的 DT，这是我所拥有的示例：

   group   start     class     rn
 1:  A      4943         4      1
 2:  A      5030         0      2
 3:  A      5030         4      3
 4:  A      5030         2      4
 5:  A      5083         4      5
 6:  A      5083         3      6
 7:  B      5041         0      7
 8:  B      5041         1      8
 9:  B      5083         4      9
 ...

My desired result is only the rn corresponding to the minimium class value: 我期望的结果只是对应于最小类值的rn ：

   group   start     class     rn
 1:  A      4943         4      1
 2:  A      5030         0      2
 3:  A      5083         3      6
 4:  B      5041         0      7
 5:  B      5083         4      9
 ...

But if I use: 但是，如果我使用：

DT[, list(class=min(class)), by=list(group, start, rn)]

or 要么

DT[, list(class=min(class), rn), by=list(group, start)]

I get all the rows, not only the rows with class minimum. 我得到所有行，不仅是类最少的行。

Extra question 额外的问题

I'd be possible to get a count of the cases of each class type in the group using data.table sintax using my command? 我可以使用我的命令使用data.table sintax获取组中每个类类型的情况的计数？

   group   start     class     rn    class0    class1    class2    class3    class4
 1:  A      4943         4      1         0         0         0         0         1
 2:  A      5030         0      2         1         0         1         0         1
 3:  A      5083         3      6         0         0         0         1         1
 4:  B      5041         0      7         1         1         0         0         0
 5:  B      5083         4      9         0         0         0         0         1
 ...

Answer 1

For your first question, you're basically calling min on each group. 对于第一个问题，您基本上是在每个组上呼叫min 。 This is not necessary. 这不是必需的。 If you sort the column class as well (by setting the key ), then you can use mult="first" feature to just pick the smallest element directly. 如果还对列class排序（通过设置key ），则可以使用mult="first"功能直接选择最小的元素。 That is, 那是，

setkey(dt, group, start, class)
dt[CJ(unique(group), unique(start)), mult="first", nomatch=0]
   group start class rn
1:     A  4943     4  1
2:     A  5030     0  2
3:     A  5083     3  6
4:     B  5041     0  7
5:     B  5083     4  9

Alternatively if you don't want to use CJ here, then you can do this: 另外，如果您不想在这里使用CJ ，则可以执行以下操作：

setkey(dt, group, start, class)
dt[, list(class=class[1], rn=rn[1]), by=list(group, start)]

Edit 2: 编辑2：

Here's a complete answer: 这是一个完整的答案：

dt.out <- dt[, c(list(class = class[1], rn=rn[1]), 
       {tt <- rep(0,5); tt[class+1] <- 1; as.list(tt)}), by=list(group, start)]
setnames(dt.out, 5:9, paste0("Class", 0:4))

   group start class rn Class0 Class1 Class2 Class3 Class4
1:     A  4943     4  1      0      0      0      0      1
2:     A  5030     0  2      1      0      1      0      1
3:     A  5083     3  6      0      0      0      1      1
4:     B  5041     0  7      1      1      0      0      0
5:     B  5083     4  9      0      0      0      0      1

与data.table分组后获取row.name

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-07-22 13:51:04

与data.table分组后获取row.name

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-07-22 13:51:04

解决方案1
2 已采纳 2013-07-22 13:51:04