R数据表：使用条件列和另一列替换跨多列的行值子集

Question

This is my first post in stack overflow so forgive any mistakes. 这是我在堆栈溢出中的第一篇文章，因此请原谅任何错误。 I'm also very new to R syntax and data tables. 我对R语法和数据表也很陌生。

Specifically for a data table, I want to conditionally test and replace row values across four columns in comparison with values in a fifth column. 特别是对于数据表，我想有条件地测试和替换第四列中的行值，而不是第五列中的值。 Example data is the following: 示例数据如下：

head(loadProfiles)
    load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
 1:   8469.231    2317.895        36700.00        220200.000   8808
 2:   8768.000    2609.524        36533.33         36533.333   8768
 3:   8744.000    3168.116        27325.00         10409.524   8744
 4:   7006.452    3810.526        24133.33          3620.000   8688
 5:   5794.595    4660.870        19490.91          2144.000   8576
 6:   6057.143    5888.889        16307.69          2208.333   8480
 7:   7036.667    7279.310        14073.33          2814.667   8444
 8:   8107.692    8107.692        14053.33          3634.483   8432
 9:   8138.462    9200.000        11755.56          3992.453   8464
10:   8173.077   10625.000        10119.05          4427.083   8500

What I would like to do is loop the following action over each of the first 4 columns, comparing each column to values in the fifth column. 我想做的是在前4列的每一列上循环以下操作，将每一列与第五列中的值进行比较。

loadProfiles[load_ev_ag >= maxICA, load_ev_ag := maxICA]

The result I want should look like the following: 我想要的结果应如下所示：

head(loadProfiles)
    load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
 1:   8469.231    2317.895            8808              8808   8808
 2:   8768.000    2609.524            8768              8768   8768
 3:   8744.000    3168.116            8744              8744   8744
 4:   7006.452    3810.526            8688          3620.000   8688
 5:   5794.595    4660.870            8576          2144.000   8576
 6:   6057.143    5888.889            8480          2208.333   8480
 7:   7036.667    7279.310            8444          2814.667   8444
 8:   8107.692    8107.692            8432          3634.483   8432
 9:   8138.462        8464            8464          3992.453   8464
10:   8173.077        8500            8500          4427.083   8500

I've tried the following with no luck: 我已经尝试了以下方法，但是没有运气：

loadProfileNames <- colnames(loadProfiles)[1:4]
loadProfiles[i = (loadProfileNames) >= maxICA,j = (loadProfileNames) := maxICA]

This produces the following warning and also changes all values in the first four columns equal to values in the fifth column 这将产生以下警告，并且还将前四列中的所有值更改为等于第五列中的值

Warning message:
In (loadProfileNames) >= maxICA :
  longer object length is not a multiple of shorter object length

I've also tried the following which changes the subset of x rows that meet the criteria i = (loadProfileNames) >= maxICA to the first x entries in maxICA rather than to the value in maxICA corresponding to row i in the subset of x rows 我还尝试了以下方法，将满足条件i = (loadProfileNames) >= maxICA的x行的子集更改为i = (loadProfileNames) >= maxICA中的前x个条目，而不是将maxICA中的值更改为与x行子集中的第i行相对应的值

for(j in loadProfileNames) { set(loadProfiles,i=which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),j=j,value=loadProfiles[["maxICA"]]) }

and produces the following warning 并产生以下警告

Warning messages:
1: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),  :
  Supplied 288 items to be assigned to 24 items of column 'load_ev_ag' (264 unused)
2: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),  :
  Supplied 288 items to be assigned to 108 items of column 'load_ev_res' (180 unused)
3: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),  :
  Supplied 288 items to be assigned to 156 items of column 'load_ev_res_tou' (132 unused)
4: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),  :
  Supplied 288 items to be assigned to 156 items of column 'load_ev_workplace' (132 unused)

I'm pretty much stuck at this point. 在这一点上，我几乎陷入了困境。 Any guidance would be much appreciated. 任何指导将不胜感激。

Answer 1

A more " data.table -way" than using get() and eval() modifies loadProfiles by reference . 与使用get()和eval()相比，“ data.table ”更多的是通过reference修改loadProfiles 。 It uses lapply(.SD, ...) together with .SDcols to identify the columns to operate on. 它使用lapply(.SD, ...)以及.SDcols来标识要操作的列。 pmin() is used instead of ifelse() . pmin()代替ifelse() 。

    cols_to_change <- stringr::str_subset(names(loadProfiles), "^load_ev")
    loadProfiles[, (cols_to_change) := lapply(.SD, function(x) pmin(x, maxICA)),
                 .SDcols = cols_to_change]
    loadProfiles
#    load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
# 1:   8469.231    2317.895            8808          8808.000   8808
# 2:   8768.000    2609.524            8768          8768.000   8768
# 3:   8744.000    3168.116            8744          8744.000   8744
# 4:   7006.452    3810.526            8688          3620.000   8688
# 5:   5794.595    4660.870            8576          2144.000   8576
# 6:   6057.143    5888.889            8480          2208.333   8480
# 7:   7036.667    7279.310            8444          2814.667   8444
# 8:   8107.692    8107.692            8432          3634.483   8432
# 9:   8138.462    8464.000            8464          3992.453   8464
#10:   8173.077    8500.000            8500          4427.083   8500

The above code could be rewritten to use the set() function: 可以重写以上代码以使用set()函数：

for (j in cols_to_change) { 
  set(loadProfiles, ,j = j, value = pmin(loadProfiles[[j]], loadProfiles[["maxICA"]])) 
}

Benchmark 基准

Inspired by Frank's comment I was wondering what the best approach is in terms of performance. 受弗兰克评论的启发，我想知道在性能方面最好的方法是什么。 For benchmarking, a data.table with 100000 rows is created by replicating the OP's data. 为了进行基准测试，通过复制OP的数据来创建具有100000行的data.table。

# create data.table with 100 000 rows
lp <- copy(loadProfiles0)
dummy <- lapply(1:4, function(x) lp <<- 
                  rbindlist(list(lp, lp, lp, lp, lp, lp, lp, lp, lp, lp)))
nrow(lp)
#100000

As all approaches modify the loadProfiles in place, we need to take a copy before each run. 由于所有方法都修改了loadProfiles ，因此我们需要在每次运行之前制作一份副本。 The copy operation is also benchmarked for comparison. 复制操作也以基准进行比较。

microbenchmark::microbenchmark(
  copy = loadProfiles <- copy(lp),
  chris = {
    loadProfiles <- copy(lp)
    for (i in cols_to_change) { 
      loadProfiles[get(i) >= maxICA, eval(i) := as.double(maxICA)]
    }
  },
  frank = {
    loadProfiles <- copy(lp)
    for (i in cols_to_change) { 
      loadProfiles[get(i) >= maxICA, (i) := as.double(maxICA)]
    }
  },
  uwe = {
    loadProfiles <- copy(lp)
    loadProfiles[, (cols_to_change) := lapply(.SD, function(x) pmin(x, maxICA)),
                 .SDcols = cols_to_change]
  },
  set = {
    loadProfiles <- copy(lp)
    for (j in cols_to_change) { 
      set(loadProfiles, , j = j, value = pmin(loadProfiles[[j]], loadProfiles[["maxICA"]])) 
    }
  }
)

Results: 结果：

#Unit: microseconds
#  expr      min        lq      mean    median        uq        max neval
#  copy  592.427  1007.012  1170.425  1111.224  1238.281   3977.826   100
# chris 8525.045 10614.394 12704.450 11499.447 12152.475 140577.520   100
# frank 4972.000  6799.118  8566.945  7339.060  7819.344 133202.589   100
#   uwe 4201.354  6297.689  6711.409  6585.595  6914.846  10546.996   100
#   set 3716.539  5580.662  7138.738  5907.836  6264.840 127311.557   100

Frank's suggestion to remove eval() from christoph's solution has gained a remarkable speed increase. 弗兰克（Frank eval()从christoph 解决方案中删除eval()的建议获得了显着的速度提高。 However, the other two solutions are still faster with set slightly ahead. 但是，其他两种解决方案仍然较快，但set稍早。

Data 数据

loadProfiles0 <- fread("load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
         8469.231    2317.895        36700.00        220200.000   8808
         8768.000    2609.524        36533.33         36533.333   8768
         8744.000    3168.116        27325.00         10409.524   8744
         7006.452    3810.526        24133.33          3620.000   8688
         5794.595    4660.870        19490.91          2144.000   8576
         6057.143    5888.889        16307.69          2208.333   8480
         7036.667    7279.310        14073.33          2814.667   8444
         8107.692    8107.692        14053.33          3634.483   8432
         8138.462    9200.000        11755.56          3992.453   8464
         8173.077   10625.000        10119.05          4427.083   8500")

Answer 2

Your first attempt was almost right: 您的第一次尝试几乎是正确的：

profilenames <- names(loadProfiles)[1:4]
for (i in profilenames) { 
  loadProfiles[get(i) >= maxICA, eval(i) := as.double(maxICA)]
}

Answer 3

You could also solve this with lapply and ifelse , even valid for data.frames : 您也可以使用lapply和ifelse解决此ifelse ，甚至对data.frames也有效：

loadProfiles[loadProfileNames] <- lapply(loadProfiles[loadProfileNames],
  function (i) ifelse (i >= loadProfiles$maxICA, loadProfiles$maxICA, i))

And for data.tables , the .SD variable is a good resource: 对于data.tables ， .SD变量是一个很好的资源：

loadProfile[, lapply(.SD, function(i) ifelse(i >= maxICA, maxICA, i)), .SDcols = loadProfileNames]

R数据表：使用条件列和另一列替换跨多列的行值子集

问题描述

3 个解决方案

解决方案1
3 2017-03-22 21:30:40

Benchmark 基准

Data 数据

解决方案2
1 已采纳 2017-03-22 20:44:41

解决方案3
0 2017-03-22 21:16:50

R数据表：使用条件列和另一列替换跨多列的行值子集

问题描述

3 个解决方案

解决方案1 3 2017-03-22 21:30:40

Benchmark 基准

Data 数据

解决方案2 1 已采纳 2017-03-22 20:44:41

解决方案3 0 2017-03-22 21:16:50

解决方案1
3 2017-03-22 21:30:40

解决方案2
1 已采纳 2017-03-22 20:44:41

解决方案3
0 2017-03-22 21:16:50