簡體   English   中英

R data.table 如果超過大數據集的某個閾值,則將列值的其余部分設置為下一列值

[英]R data.table Setting the remainder of column values to next column value if exceeding a certain threshold for a large data set

我正在研究一個簡單的削峰算法,並尋找將列值的其余部分設置為下一列的最優化方法,如果該值超過了大時間序列的某個閾值。

考慮到我有以下示例數據集,為每個閾值設置了一定的閾值,目標是獲得一個 data.table,其中的值由它們的閾值限制,其余的被添加到下一列值(不超過它們的閾值)和等等到某個窗口限制。

loads <- data.table(index = 1:3,
                    time1 = c(6600,3000, 12000),
                    time2 = c(12000, 4000, 2000),
                    time3 = c(0, 0, 0),
                    time4 = c(3000,12000,0),
                    time5 = c(5000, 2000, 3000),
                    time6 = c(0, 0, 0),
                    time7 = c(15000, 0, 0))

thresholds <- c("time1" = 5000, 
                "time2" = 5000,
                "time3" = 5000,
                "time4" = 12000,
                "time5" = 12000,
                "time6" = 12000,
                "time7" = 5000)

對於 7 列的窗口,這應該導致以下 data.table:

res <- data.table(index = 1:3,
                  time1 = c(5000, 3000, 5000),
                  time2 = c(5000, 4000, 5000),
                  time3 = c(5000, 0, 4000),
                  time4 = c(6600, 12000, 0),
                  time5 = c(5000, 2000, 3000),
                  time6 = c(0, 0, 0),
                  time7 = c(5000, 0, 0))

我知道有一些明顯的方法可以按行執行此操作,但我正在尋找一種更矢量化/data.table 的方法來執行此操作。

我不認為這很容易(甚至可能?)“只是”矢量化/ data.table規范代碼,但這里有一個直接的for循環,它像data.table一樣data.table (我認為)合理地(我認為) .

timeX :我將timeX添加到thresholdsInf限制)和loads (值0 )作為一個timeX列,以便我們知道行的其余部分“丟失”了多少。 將它用於for循環也很方便(盡管可以不用,通過一些代碼重寫)。

library(data.table)
thresholds <- c("time1" = 5000, 
                "time2" = 5000,
                "time3" = 5000,
                "time4" = 12000,
                "time5" = 12000,
                "time6" = 12000,
                "time7" = 5000,
                "timeX" = Inf)
loads[, timeX := 0 ]

for (ind in seq_along(thresholds)) {
  if (ind >= length(thresholds)) break
  nm <- names(thresholds)[ind]
  nm1 <- names(thresholds)[ind+1]
  rmndr <- pmax(0, loads[[nm]] - thresholds[ind])
  set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[ind]))
  set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}
loads
#    index time1 time2 time3 time4 time5 time6 time7 timeX
#    <int> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:     1  5000  5000  5000  6600  5000     0  5000 10000
# 2:     2  3000  4000     0 12000  2000     0     0     0
# 3:     3  5000  5000  4000     0  3000     0     0     0

或者如果你真的不在乎丟棄的數字,那么

## using unmodified `loads` and `thresholds`
for (ind in seq_along(thresholds)) {
  nm <- names(thresholds)[ind]
  rmndr <- pmax(0, loads[[nm]] - thresholds[nm])
  set(loads, i = NULL, j = nm, value = pmin(loads[[nm]], thresholds[nm]))
  if (ind == length(thresholds)) break
  nm1 <- names(thresholds)[ind+1]
  set(loads, i = NULL, j = nm1, value = loads[[nm1]] + rmndr)
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM