简体   繁体   中英

Weighted difference calculation on a data.table

I need your help in writing a efficient function to calculate a weighted difference. Preferrably with data.table as my dataset has a million records.

Here is a small df with two groups 1 and 2. It is ordered by layer from 1 to maximum 5 (it could also be less than 5 layers, eg group = 2 has only 3)

group <- c(1, 1, 1, 1, 1, 2, 2, 2)
layer <- c(1, 2, 3, 4, 5, 1, 2, 3)
value <- c(10.62, 10.61, 10.60, 10.40, 10.34, 15.40, 15.39, 15.35)
reference_value <- c(10.90, 10.90, 10.90, 10.90, 10.90, 15.70, 15.70, 15.70)
size <- c(30, 120, 250, 40, 20, 70, 80, 90)
cumsum_size <- c(30, 150, 400, 440, 460, 70, 150, 240)

df <- data.frame(group, layer, value, reference_value, size, cumsum_size)

I need to calculate a weighted difference to reference_value column for different target sizes for every group. Say, I have the following

target_sizes <- c(20, 50, 200, 500)

To calculate the weighted diff the target_size needs to be compared to the cumsum_size to decide at which layer it will be, so, for df$group == 1 the calculation should be:

Case 1
target size 20

# it stays on layer 1 as 20 < 30
  weighted impact is  

  abs(10.62-10.90)*20

Case 2
# target size 50
it takes 30 from layer 1 and additional 20 from layer 2

weighted impact is 
  abs(10.62 - 10.90)*30 + abs(10.61 - 10.90)*20


Case 3
  # target size 200  
it takes 30 from layer 1, 120 from layer 2 and 50 from layer 3  

weighted impact is 
abs(10.62 - 10.90)*30 + abs(10.61 - 10.90)*120 + abs(10.60 - 10.90)*50


Case 4
# target size 500  

as the cumsum_size in the last layer 5 (460) less than 500
it should return NA

Here is one possible approach.

#since abs diff never changes, calculate it first
DT[, absDiff := abs(value - reference_value)]

#merge the unique set of group & cumsum_size with DT
dat <- merge(DT, DT[,CJ(isTarg=TRUE, group=unique(group), cumsum_size=target_sizes)], 
    by=c("group", "cumsum_size"), all=TRUE)

#backfill the absDiff using zoo::na.locf
dat[, absDiff := na.locf(absDiff, na.rm=FALSE, fromLast=TRUE), by=.(group)]

#calculate the desired results by group and filter for chosen target_sizes
dat[, result := cumsum(diff(c(0, cumsum_size)) * absDiff), by=.(group)][(isTarg)]

output:

   group cumsum_size layer value reference_value size absDiff isTarg result
1:     1          20    NA    NA              NA   NA    0.28   TRUE    5.6
2:     1          50    NA    NA              NA   NA    0.29   TRUE   14.2
3:     1         200    NA    NA              NA   NA    0.30   TRUE   58.2
4:     1         500    NA    NA              NA   NA      NA   TRUE     NA
5:     2          20    NA    NA              NA   NA    0.30   TRUE    6.0
6:     2          50    NA    NA              NA   NA    0.30   TRUE   15.0
7:     2         200    NA    NA              NA   NA    0.35   TRUE   63.3
8:     2         500    NA    NA              NA   NA      NA   TRUE     NA

data and packages:

library(data.table)
library(zoo)
group <- c(1, 1, 1, 1, 1, 2, 2, 2)
layer <- c(1, 2, 3, 4, 5, 1, 2, 3)
value <- c(10.62, 10.61, 10.60, 10.40, 10.34, 15.40, 15.39, 15.35)
reference_value <- c(10.90, 10.90, 10.90, 10.90, 10.90, 15.70, 15.70, 15.70)
size <- c(30, 120, 250, 40, 20, 70, 80, 90)
cumsum_size <- c(30, 150, 400, 440, 460, 70, 150, 240)

DT <- data.table(group, layer, value, reference_value, size, cumsum_size)
target_sizes <- c(20, 50, 200, 500)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM