简体   繁体   English

与data.table分组后获取row.name

[英]Get row.name after grouping with data.table

I'm new with data.table but I've managed to reduce a computation in a dataset of 600K rows from thousands of seconds (using *ply loops) to 1.7sec. 我是data.table的新手,但我设法将60万行数据集中的计算量从数千秒(使用* ply循环)减少到1.7sec。 Basically I need the row with the lowest value in the column class in the groups of the same group and start . 基本上,我需要在同一的组中的列中具有最低值的行并开始 I'm using 我正在使用

DT[, list(class=min(class)), by=list(group, start)]

But to do that I created DT with only these 3 columns from a data.frame with more columns. 但是要做到这一点,我从data.frame中创建了仅包含这三列的DT,其中包含更多列。 So, to merge my results with the original data.frame I'm thinking of using the row.name, so I created DT with row.name=TRUE and this is an example of what I have: 因此,为了将结果与原始data.frame合并,我正在考虑使用row.name,因此我创建了具有row.name = TRUE的 DT,这是我所拥有的示例:

   group   start     class     rn
 1:  A      4943         4      1
 2:  A      5030         0      2
 3:  A      5030         4      3
 4:  A      5030         2      4
 5:  A      5083         4      5
 6:  A      5083         3      6
 7:  B      5041         0      7
 8:  B      5041         1      8
 9:  B      5083         4      9
 ...

My desired result is only the rn corresponding to the minimium class value: 我期望的结果只是对应于最小值的rn

   group   start     class     rn
 1:  A      4943         4      1
 2:  A      5030         0      2
 3:  A      5083         3      6
 4:  B      5041         0      7
 5:  B      5083         4      9
 ...

But if I use: 但是,如果我使用:

DT[, list(class=min(class)), by=list(group, start, rn)]

or 要么

DT[, list(class=min(class), rn), by=list(group, start)]

I get all the rows, not only the rows with class minimum. 我得到所有行,不仅是最少的行。

Extra question 额外的问题

I'd be possible to get a count of the cases of each class type in the group using data.table sintax using my command? 我可以使用我的命令使用data.table sintax获取组中每个类类型的情况的计数?

   group   start     class     rn    class0    class1    class2    class3    class4
 1:  A      4943         4      1         0         0         0         0         1
 2:  A      5030         0      2         1         0         1         0         1
 3:  A      5083         3      6         0         0         0         1         1
 4:  B      5041         0      7         1         1         0         0         0
 5:  B      5083         4      9         0         0         0         0         1
 ...

For your first question, you're basically calling min on each group. 对于第一个问题,您基本上是在每个组上呼叫min This is not necessary. 这不是必需的。 If you sort the column class as well (by setting the key ), then you can use mult="first" feature to just pick the smallest element directly. 如果还对列class排序(通过设置key ),则可以使用mult="first"功能直接选择最小的元素。 That is, 那是,

setkey(dt, group, start, class)
dt[CJ(unique(group), unique(start)), mult="first", nomatch=0]
   group start class rn
1:     A  4943     4  1
2:     A  5030     0  2
3:     A  5083     3  6
4:     B  5041     0  7
5:     B  5083     4  9

Alternatively if you don't want to use CJ here, then you can do this: 另外,如果您不想在这里使用CJ ,则可以执行以下操作:

setkey(dt, group, start, class)
dt[, list(class=class[1], rn=rn[1]), by=list(group, start)]

Edit 2: 编辑2:

Here's a complete answer: 这是一个完整的答案:

dt.out <- dt[, c(list(class = class[1], rn=rn[1]), 
       {tt <- rep(0,5); tt[class+1] <- 1; as.list(tt)}), by=list(group, start)]
setnames(dt.out, 5:9, paste0("Class", 0:4))

   group start class rn Class0 Class1 Class2 Class3 Class4
1:     A  4943     4  1      0      0      0      0      1
2:     A  5030     0  2      1      0      1      0      1
3:     A  5083     3  6      0      0      0      1      1
4:     B  5041     0  7      1      1      0      0      0
5:     B  5083     4  9      0      0      0      0      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM