简体   繁体   中英

R data.table Setting the remainder of column values to next column value if exceeding a certain threshold for a large data set

I am working on an simple peak shaving algorithm and looking for the most optimized way of setting the remainder of column values to the next column if the value exceeds a certain threshold for a large time series.

Considering I have the following example dataset with certain threshold set for each threshold, the goal is to get a data.table where the values are capped off by their threshold and the remainder are added to the next column value (not exceeding their threshold) and so on to a certain window limit.

loads <- data.table(index = 1:3,
                    time1 = c(6600,3000, 12000),
                    time2 = c(12000, 4000, 2000),
                    time3 = c(0, 0, 0),
                    time4 = c(3000,12000,0),
                    time5 = c(5000, 2000, 3000),
                    time6 = c(0, 0, 0),
                    time7 = c(15000, 0, 0))

thresholds <- c("time1" = 5000, 
                "time2" = 5000,
                "time3" = 5000,
                "time4" = 12000,
                "time5" = 12000,
                "time6" = 12000,
                "time7" = 5000)

With a window of 7 columns this should result in the following data.table:

res <- data.table(index = 1:3,
                  time1 = c(5000, 3000, 5000),
                  time2 = c(5000, 4000, 5000),
                  time3 = c(5000, 0, 4000),
                  time4 = c(6600, 12000, 0),
                  time5 = c(5000, 2000, 3000),
                  time6 = c(0, 0, 0),
                  time7 = c(5000, 0, 0))

I know there are some obvious ways to do this row-wise, but I am looking for a more vectorized/data.table approach to do this.

I don't think this is easy (or even possible?) with "just" vectorized/ data.table -canonical code, but here's a straight-forward for loop that does it as data.table -efficiently (I think) as reasonable.

Up front: I add timeX to both thresholds ( Inf limit) and loads (value of 0 ) as a catch-all column so we know how much from the remainders of the row has been "lost". It's handy to have it for the for loop, as well (though can be done without, with some code-rewrite).

library(data.table)
thresholds <- c("time1" = 5000, 
                "time2" = 5000,
                "time3" = 5000,
                "time4" = 12000,
                "time5" = 12000,
                "time6" = 12000,
                "time7" = 5000,
                "timeX" = Inf)
loads[, timeX := 0 ]

for (ind in seq_along(thresholds)) {
  if (ind >= length(thresholds)) break
  nm <- names(thresholds)[ind]
  nm1 <- names(thresholds)[ind+1]
  rmndr <- pmax(0, loads[[nm]] - thresholds[ind])
  set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[ind]))
  set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}
loads
#    index time1 time2 time3 time4 time5 time6 time7 timeX
#    <int> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:     1  5000  5000  5000  6600  5000     0  5000 10000
# 2:     2  3000  4000     0 12000  2000     0     0     0
# 3:     3  5000  5000  4000     0  3000     0     0     0

Or if you really don't care about the discarded numbers, then

## using unmodified `loads` and `thresholds`
for (ind in seq_along(thresholds)) {
  nm <- names(thresholds)[ind]
  rmndr <- pmax(0, loads[[nm]] - thresholds[nm])
  set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[nm]))
  if (ind == length(thresholds)) break
  nm1 <- names(thresholds)[ind+1]
  set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM