简体   繁体   中英

Get row.name after grouping with data.table

I'm new with data.table but I've managed to reduce a computation in a dataset of 600K rows from thousands of seconds (using *ply loops) to 1.7sec. Basically I need the row with the lowest value in the column class in the groups of the same group and start . I'm using

DT[, list(class=min(class)), by=list(group, start)]

But to do that I created DT with only these 3 columns from a data.frame with more columns. So, to merge my results with the original data.frame I'm thinking of using the row.name, so I created DT with row.name=TRUE and this is an example of what I have:

   group   start     class     rn
 1:  A      4943         4      1
 2:  A      5030         0      2
 3:  A      5030         4      3
 4:  A      5030         2      4
 5:  A      5083         4      5
 6:  A      5083         3      6
 7:  B      5041         0      7
 8:  B      5041         1      8
 9:  B      5083         4      9
 ...

My desired result is only the rn corresponding to the minimium class value:

   group   start     class     rn
 1:  A      4943         4      1
 2:  A      5030         0      2
 3:  A      5083         3      6
 4:  B      5041         0      7
 5:  B      5083         4      9
 ...

But if I use:

DT[, list(class=min(class)), by=list(group, start, rn)]

or

DT[, list(class=min(class), rn), by=list(group, start)]

I get all the rows, not only the rows with class minimum.

Extra question

I'd be possible to get a count of the cases of each class type in the group using data.table sintax using my command?

   group   start     class     rn    class0    class1    class2    class3    class4
 1:  A      4943         4      1         0         0         0         0         1
 2:  A      5030         0      2         1         0         1         0         1
 3:  A      5083         3      6         0         0         0         1         1
 4:  B      5041         0      7         1         1         0         0         0
 5:  B      5083         4      9         0         0         0         0         1
 ...

For your first question, you're basically calling min on each group. This is not necessary. If you sort the column class as well (by setting the key ), then you can use mult="first" feature to just pick the smallest element directly. That is,

setkey(dt, group, start, class)
dt[CJ(unique(group), unique(start)), mult="first", nomatch=0]
   group start class rn
1:     A  4943     4  1
2:     A  5030     0  2
3:     A  5083     3  6
4:     B  5041     0  7
5:     B  5083     4  9

Alternatively if you don't want to use CJ here, then you can do this:

setkey(dt, group, start, class)
dt[, list(class=class[1], rn=rn[1]), by=list(group, start)]

Edit 2:

Here's a complete answer:

dt.out <- dt[, c(list(class = class[1], rn=rn[1]), 
       {tt <- rep(0,5); tt[class+1] <- 1; as.list(tt)}), by=list(group, start)]
setnames(dt.out, 5:9, paste0("Class", 0:4))

   group start class rn Class0 Class1 Class2 Class3 Class4
1:     A  4943     4  1      0      0      0      0      1
2:     A  5030     0  2      1      0      1      0      1
3:     A  5083     3  6      0      0      0      1      1
4:     B  5041     0  7      1      1      0      0      0
5:     B  5083     4  9      0      0      0      0      1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM