[英]by-group calculation, limited to first N rows of each group
I asked a question before and received a good answer but I needed to apply it to a more specific problem. 我之前问过一个问题,并得到了很好的答案,但我需要将其应用于更具体的问题。 The
DT
needs to be divided into 16 sectors based on X and Y values. DT
需要根据X和Y值分为16个扇区。 The X and Y variables represent the coordinates to loop through and divide the data table. X和Y变量表示要遍历和划分数据表的坐标。 I have successfully divided this data table into 16 different 'sectors' and I need to apply the sCalc function on each sector and output a number.
我已经成功地将此数据表划分为16个不同的“扇区”,我需要在每个扇区上应用sCalc函数并输出一个数字。 I'm looking for a faster way to do this.
我正在寻找一种更快的方法。
Refer to this link for clarification if needed: Faster way to subset data table instead of a for loop R . 如果需要,请参考此链接进行澄清: 更快的子集数据表方法,而不是for循环R。
library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each = 2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))
sCalc <- function(DT) {
setkey(DT, Norm)
cells <- DT[1:(nrow(DT)*0.02)]
nCells <- nrow(DT)
sumCell <- sum(cells[,Norm/sqrt(Unif)])
return(sumCell/nCells)
}
startstop <- function(width, y = FALSE) {
startend <- width - (width/4 - 1)
start <- round(seq(0, startend, length.out = 4))
stop <- round(seq(width/4, width, length.out = 4))
if (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
dup <- anyDuplicated(c(start,stop))
stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
if (y == TRUE) {
coord <- list(rep(start, each = 4), rep(stop, each = 4))
} else if (y == FALSE) {
coord <- list(rep(start, times = 4), rep(stop, times = 4))
}
return(coord)
}
sectorCalc <- function(x,y,DT) {
sector <- numeric(length = 16)
for (i in 1:length(sector)) {
sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
sector[i] <- sCalc(sect)
}
return(sector)
}
x <- startstop(2000)
y <- startstop(1600, y = TRUE)
sectorLoop <- sectorCalc(x,y,DT)
sectorLoop
returns: sectorLoop
返回:
-4.729271 -4.769156 -4.974996 -4.931120 -4.777013 -4.644919 -4.958968 -4.663221 -4.771545 -4.909868 -4.821098 -4.795526 -4.846709 -4.931514 -4.875148 -4.847105 -4.729271 -4.769156 -4.974996 -4.931120 -4.777013 -4.644919 -4.958968 -4.663221 -4.771545 -4.909868 -4.821098 -4.795526 -4.846709 -4.931514 -4.875148 -4.847105
One solution was using the cut
function. 一种解决方案是使用
cut
功能。
DT[, x.sect := cut(DT[, X], seq(0, 2000, by = 500), dig.lab=10)]
DT[, y.sect := cut(DT[, Y], seq(0, 1600, by = 400), dig.lab=10)]
sectorRef <- DT[order(Norm), .(sCalc = sum(Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/(0.02*.N)), by = .(x.sect, y.sect)]
sectorRef <- sectorRef[[3]]
The above solution returns a data table with the values: 上面的解决方案返回一个带有值的数据表:
-4.919447 -4.778576 -4.757455 -4.779086 -4.739814 -4.836497 -4.776635 -4.656748 -4.939441 -4.707901 -4.751791 -4.864481 -4.839134 -4.973294 -4.663360 -5.055344 -4.919447 -4.778576 -4.757455 -4.779086 -4.739814 -4.836497 -4.776635 -4.656748 -4.939441 -4.707901 -4.751791 -4.864481 -4.839134 -4.973294 -4.663360 -5.055344
cor(sectorRef, sectorLoop)
The above returns: 0.0726904 以上收益:0.0726904
As far as I can understand the question, the first thing I would explain is that you can use .N
to tell you how many rows there are in each by=.(...)
group. 就我所能理解的问题而言,我要解释的第一件事是您可以使用
.N
来告诉您每个by=.(...)
组中有多少行。 I think that is analogous to your nCells
. 我认为这类似于您的
nCells
。
And where your cells
takes the top 2% of rows in each group, this can be accomplished at the vector level by indexing [1:(0.02*.N)]
. 而且,如果您的
cells
占据了每个组中行的前2%,则可以在矢量级别通过索引[1:(0.02*.N)]
。 Assuming you want the top 2% in order of increasing Norm
(which is the order you would get from setkey(DT, Norm)
, although setting a key does more than just sorting), you could call setkey(DT, Norm)
before the calculation, as in the example, or to make it clearer what you are doing, you could use order(Norm)
inside your calculation. 假设您希望按递增
Norm
数的顺序获得最高的2%的价格(这是您从setkey(DT, Norm)
获得的顺序,尽管设置键的作用不只是排序),您可以在调用之前调用setkey(DT, Norm)
如示例中所示,或者为了使您的工作更清楚,您可以在计算中使用order(Norm)
。
The sum()
part doesn't change, so the equivalent third line is: sum()
部分不变,因此等效的第三行是:
DT[order(Norm),
.(sCalc = sum( Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/.N),
by = .(x.sect, y.sect)]
Which returns the operation for the 16 groups: 该操作返回16个组的操作:
x.sect y.sect sCalc
1: (1500,2000] (800,1200] -0.09380209
2: (499,1000] (399,800] -0.09833151
3: (499,1000] (1200,1600] -0.09606350
4: (0,499] (399,800] -0.09623751
5: (0,499] (800,1200] -0.09598717
6: (1500,2000] (0,399] -0.09306580
7: (1000,1500] (399,800] -0.09669593
8: (1500,2000] (399,800] -0.09606388
9: (1500,2000] (1200,1600] -0.09368166
10: (499,1000] (0,399] -0.09611643
11: (1000,1500] (0,399] -0.09404482
12: (0,499] (1200,1600] -0.09387951
13: (1000,1500] (1200,1600] -0.10069461
14: (1000,1500] (800,1200] -0.09825285
15: (0,499] (0,399] -0.09890184
16: (499,1000] (800,1200] -0.09756506
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.