简体   繁体   English

按组计算,限于每组的前N行

[英]by-group calculation, limited to first N rows of each group

I asked a question before and received a good answer but I needed to apply it to a more specific problem. 我之前问过一个问题,并得到了很好的答案,但我需要将其应用于更具体的问题。 The DT needs to be divided into 16 sectors based on X and Y values. DT需要根据X和Y值分为16个扇区。 The X and Y variables represent the coordinates to loop through and divide the data table. X和Y变量表示要遍历和划分数据表的坐标。 I have successfully divided this data table into 16 different 'sectors' and I need to apply the sCalc function on each sector and output a number. 我已经成功地将此数据表划分为16个不同的“扇区”,我需要在每个扇区上应用sCalc函数并输出一个数字。 I'm looking for a faster way to do this. 我正在寻找一种更快的方法。

Refer to this link for clarification if needed: Faster way to subset data table instead of a for loop R . 如果需要,请参考此链接进行澄清: 更快的子集数据表方法,而不是for循环R。

library(data.table)
DT <- data.table(X = rep(1:2000, times = 1600), Y = rep(1:1600, each =   2000), Norm =rnorm(1600*2000), Unif = runif(1600*2000))

sCalc <- function(DT) { 
    setkey(DT, Norm) 
    cells <- DT[1:(nrow(DT)*0.02)] 
    nCells <- nrow(DT) 
    sumCell <- sum(cells[,Norm/sqrt(Unif)]) 
    return(sumCell/nCells) 
} 

startstop <- function(width, y = FALSE) {
    startend <- width - (width/4 - 1)
    start <- round(seq(0, startend, length.out = 4))
    stop <- round(seq(width/4, width, length.out = 4))
    if  (length(c(start,stop)[anyDuplicated(c(start,stop))]) != 0) {
        dup <- anyDuplicated(c(start,stop))
        stop[which(stop == c(start,stop)[dup])] <- stop[which(stop == c(start,stop)[dup])] - 1
}
    if (y == TRUE) {
        coord <- list(rep(start, each = 4), rep(stop, each = 4))
  } else if (y == FALSE) {
        coord <- list(rep(start, times = 4), rep(stop, times = 4))
  }
  return(coord)
}

sectorCalc <- function(x,y,DT) {
    sector <- numeric(length = 16)
    for (i in 1:length(sector)) {
        sect <- DT[X %between% c(x[[1]][i],x[[2]][i]) & Y %between% c(y[[1]][i],y[[2]][i])]
        sector[i] <- sCalc(sect)
    }
    return(sector)
}

x <- startstop(2000)
y <- startstop(1600, y = TRUE)

sectorLoop <- sectorCalc(x,y,DT)

sectorLoop returns: sectorLoop返回:

-4.729271 -4.769156 -4.974996 -4.931120 -4.777013 -4.644919 -4.958968 -4.663221 -4.771545 -4.909868 -4.821098 -4.795526 -4.846709 -4.931514 -4.875148 -4.847105 -4.729271 -4.769156 -4.974996 -4.931120 -4.777013 -4.644919 -4.958968 -4.663221 -4.771545 -4.909868 -4.821098 -4.795526 -4.846709 -4.931514 -4.875148 -4.847105

One solution was using the cut function. 一种解决方案是使用cut功能。

DT[, x.sect := cut(DT[, X], seq(0, 2000, by = 500), dig.lab=10)]
DT[, y.sect := cut(DT[, Y], seq(0, 1600, by = 400), dig.lab=10)]
sectorRef <- DT[order(Norm), .(sCalc = sum(Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)])  )/(0.02*.N)), by = .(x.sect, y.sect)]
sectorRef <- sectorRef[[3]]

The above solution returns a data table with the values: 上面的解决方案返回一个带有值的数据表:

-4.919447 -4.778576 -4.757455 -4.779086 -4.739814 -4.836497 -4.776635 -4.656748 -4.939441 -4.707901 -4.751791 -4.864481 -4.839134 -4.973294 -4.663360 -5.055344 -4.919447 -4.778576 -4.757455 -4.779086 -4.739814 -4.836497 -4.776635 -4.656748 -4.939441 -4.707901 -4.751791 -4.864481 -4.839134 -4.973294 -4.663360 -5.055344

cor(sectorRef, sectorLoop)

The above returns: 0.0726904 以上收益:0.0726904

As far as I can understand the question, the first thing I would explain is that you can use .N to tell you how many rows there are in each by=.(...) group. 就我所能理解的问题而言,我要解释的第一件事是您可以使用.N来告诉您每个by=.(...)组中有多少行。 I think that is analogous to your nCells . 我认为这类似于您的nCells

And where your cells takes the top 2% of rows in each group, this can be accomplished at the vector level by indexing [1:(0.02*.N)] . 而且,如果您的cells占据了每个组中行的前2%,则可以在矢量级别通过索引[1:(0.02*.N)] Assuming you want the top 2% in order of increasing Norm (which is the order you would get from setkey(DT, Norm) , although setting a key does more than just sorting), you could call setkey(DT, Norm) before the calculation, as in the example, or to make it clearer what you are doing, you could use order(Norm) inside your calculation. 假设您希望按递增Norm数的顺序获得最高的2%的价格(这是您从setkey(DT, Norm)获得的顺序,尽管设置键的作用不只是排序),您可以在调用之前调用setkey(DT, Norm)如示例中所示,或者为了使您的工作更清楚,您可以在计算中使用order(Norm)

The sum() part doesn't change, so the equivalent third line is: sum()部分不变,因此等效的第三行是:

DT[order(Norm), 
   .(sCalc = sum( Norm[1:(0.02*.N)] / sqrt(Unif[1:(0.02*.N)]) )/.N), 
   by = .(x.sect, y.sect)]

Which returns the operation for the 16 groups: 该操作返回16个组的操作:

         x.sect      y.sect       sCalc
 1: (1500,2000]  (800,1200] -0.09380209
 2:  (499,1000]   (399,800] -0.09833151
 3:  (499,1000] (1200,1600] -0.09606350
 4:     (0,499]   (399,800] -0.09623751
 5:     (0,499]  (800,1200] -0.09598717
 6: (1500,2000]     (0,399] -0.09306580
 7: (1000,1500]   (399,800] -0.09669593
 8: (1500,2000]   (399,800] -0.09606388
 9: (1500,2000] (1200,1600] -0.09368166
10:  (499,1000]     (0,399] -0.09611643
11: (1000,1500]     (0,399] -0.09404482
12:     (0,499] (1200,1600] -0.09387951
13: (1000,1500] (1200,1600] -0.10069461
14: (1000,1500]  (800,1200] -0.09825285
15:     (0,499]     (0,399] -0.09890184
16:  (499,1000]  (800,1200] -0.09756506

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM