[英]Get row.name after grouping with data.table
I'm new with data.table but I've managed to reduce a computation in a dataset of 600K rows from thousands of seconds (using *ply loops) to 1.7sec. 我是data.table的新手,但我设法将60万行数据集中的计算量从数千秒(使用* ply循环)减少到1.7sec。 Basically I need the row with the lowest value in the column class in the groups of the same group and start .
基本上,我需要在同一组的组中的列类中具有最低值的行并开始 。 I'm using
我正在使用
DT[, list(class=min(class)), by=list(group, start)]
But to do that I created DT with only these 3 columns from a data.frame with more columns. 但是要做到这一点,我从data.frame中创建了仅包含这三列的DT,其中包含更多列。 So, to merge my results with the original data.frame I'm thinking of using the row.name, so I created DT with row.name=TRUE and this is an example of what I have:
因此,为了将结果与原始data.frame合并,我正在考虑使用row.name,因此我创建了具有row.name = TRUE的 DT,这是我所拥有的示例:
group start class rn
1: A 4943 4 1
2: A 5030 0 2
3: A 5030 4 3
4: A 5030 2 4
5: A 5083 4 5
6: A 5083 3 6
7: B 5041 0 7
8: B 5041 1 8
9: B 5083 4 9
...
My desired result is only the rn corresponding to the minimium class value: 我期望的结果只是对应于最小类值的rn :
group start class rn
1: A 4943 4 1
2: A 5030 0 2
3: A 5083 3 6
4: B 5041 0 7
5: B 5083 4 9
...
But if I use: 但是,如果我使用:
DT[, list(class=min(class)), by=list(group, start, rn)]
or 要么
DT[, list(class=min(class), rn), by=list(group, start)]
I get all the rows, not only the rows with class minimum. 我得到所有行,不仅是类最少的行。
Extra question 额外的问题
I'd be possible to get a count of the cases of each class type in the group using data.table sintax using my command? 我可以使用我的命令使用data.table sintax获取组中每个类类型的情况的计数?
group start class rn class0 class1 class2 class3 class4
1: A 4943 4 1 0 0 0 0 1
2: A 5030 0 2 1 0 1 0 1
3: A 5083 3 6 0 0 0 1 1
4: B 5041 0 7 1 1 0 0 0
5: B 5083 4 9 0 0 0 0 1
...
For your first question, you're basically calling min
on each group. 对于第一个问题,您基本上是在每个组上呼叫
min
。 This is not necessary. 这不是必需的。 If you sort the column
class
as well (by setting the key
), then you can use mult="first"
feature to just pick the smallest element directly. 如果还对列
class
排序(通过设置key
),则可以使用mult="first"
功能直接选择最小的元素。 That is, 那是,
setkey(dt, group, start, class)
dt[CJ(unique(group), unique(start)), mult="first", nomatch=0]
group start class rn
1: A 4943 4 1
2: A 5030 0 2
3: A 5083 3 6
4: B 5041 0 7
5: B 5083 4 9
Alternatively if you don't want to use CJ
here, then you can do this: 另外,如果您不想在这里使用
CJ
,则可以执行以下操作:
setkey(dt, group, start, class)
dt[, list(class=class[1], rn=rn[1]), by=list(group, start)]
Edit 2: 编辑2:
Here's a complete answer: 这是一个完整的答案:
dt.out <- dt[, c(list(class = class[1], rn=rn[1]),
{tt <- rep(0,5); tt[class+1] <- 1; as.list(tt)}), by=list(group, start)]
setnames(dt.out, 5:9, paste0("Class", 0:4))
group start class rn Class0 Class1 Class2 Class3 Class4
1: A 4943 4 1 0 0 0 0 1
2: A 5030 0 2 1 0 1 0 1
3: A 5083 3 6 0 0 0 1 1
4: B 5041 0 7 1 1 0 0 0
5: B 5083 4 9 0 0 0 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.