[英]R data table: replace subset of row values across multiple columns using conditional with another column
This is my first post in stack overflow so forgive any mistakes. 这是我在堆栈溢出中的第一篇文章,因此请原谅任何错误。 I'm also very new to R syntax and data tables.
我对R语法和数据表也很陌生。
Specifically for a data table, I want to conditionally test and replace row values across four columns in comparison with values in a fifth column. 特别是对于数据表,我想有条件地测试和替换第四列中的行值,而不是第五列中的值。 Example data is the following:
示例数据如下:
head(loadProfiles)
load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
1: 8469.231 2317.895 36700.00 220200.000 8808
2: 8768.000 2609.524 36533.33 36533.333 8768
3: 8744.000 3168.116 27325.00 10409.524 8744
4: 7006.452 3810.526 24133.33 3620.000 8688
5: 5794.595 4660.870 19490.91 2144.000 8576
6: 6057.143 5888.889 16307.69 2208.333 8480
7: 7036.667 7279.310 14073.33 2814.667 8444
8: 8107.692 8107.692 14053.33 3634.483 8432
9: 8138.462 9200.000 11755.56 3992.453 8464
10: 8173.077 10625.000 10119.05 4427.083 8500
What I would like to do is loop the following action over each of the first 4 columns, comparing each column to values in the fifth column. 我想做的是在前4列的每一列上循环以下操作,将每一列与第五列中的值进行比较。
loadProfiles[load_ev_ag >= maxICA, load_ev_ag := maxICA]
The result I want should look like the following: 我想要的结果应如下所示:
head(loadProfiles)
load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
1: 8469.231 2317.895 8808 8808 8808
2: 8768.000 2609.524 8768 8768 8768
3: 8744.000 3168.116 8744 8744 8744
4: 7006.452 3810.526 8688 3620.000 8688
5: 5794.595 4660.870 8576 2144.000 8576
6: 6057.143 5888.889 8480 2208.333 8480
7: 7036.667 7279.310 8444 2814.667 8444
8: 8107.692 8107.692 8432 3634.483 8432
9: 8138.462 8464 8464 3992.453 8464
10: 8173.077 8500 8500 4427.083 8500
I've tried the following with no luck: 我已经尝试了以下方法,但是没有运气:
loadProfileNames <- colnames(loadProfiles)[1:4]
loadProfiles[i = (loadProfileNames) >= maxICA,j = (loadProfileNames) := maxICA]
This produces the following warning and also changes all values in the first four columns equal to values in the fifth column 这将产生以下警告,并且还将前四列中的所有值更改为等于第五列中的值
Warning message:
In (loadProfileNames) >= maxICA :
longer object length is not a multiple of shorter object length
I've also tried the following which changes the subset of x rows that meet the criteria i = (loadProfileNames) >= maxICA
to the first x entries in maxICA rather than to the value in maxICA corresponding to row i in the subset of x rows 我还尝试了以下方法,将满足条件
i = (loadProfileNames) >= maxICA
的x行的子集更改为i = (loadProfileNames) >= maxICA
中的前x个条目,而不是将maxICA中的值更改为与x行子集中的第i行相对应的值
for(j in loadProfileNames) { set(loadProfiles,i=which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]),j=j,value=loadProfiles[["maxICA"]]) }
and produces the following warning 并产生以下警告
Warning messages:
1: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]), :
Supplied 288 items to be assigned to 24 items of column 'load_ev_ag' (264 unused)
2: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]), :
Supplied 288 items to be assigned to 108 items of column 'load_ev_res' (180 unused)
3: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]), :
Supplied 288 items to be assigned to 156 items of column 'load_ev_res_tou' (132 unused)
4: In set(loadProfiles, i = which(loadProfiles[[j]] >= loadProfiles[["maxICA"]]), :
Supplied 288 items to be assigned to 156 items of column 'load_ev_workplace' (132 unused)
I'm pretty much stuck at this point. 在这一点上,我几乎陷入了困境。 Any guidance would be much appreciated.
任何指导将不胜感激。
A more " data.table
-way" than using get()
and eval()
modifies loadProfiles
by reference . 与使用
get()
和eval()
相比,“ data.table
”更多的是通过reference修改loadProfiles
。 It uses lapply(.SD, ...)
together with .SDcols
to identify the columns to operate on. 它使用
lapply(.SD, ...)
以及.SDcols
来标识要操作的列。 pmin()
is used instead of ifelse()
. pmin()
代替ifelse()
。
cols_to_change <- stringr::str_subset(names(loadProfiles), "^load_ev")
loadProfiles[, (cols_to_change) := lapply(.SD, function(x) pmin(x, maxICA)),
.SDcols = cols_to_change]
loadProfiles
# load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
# 1: 8469.231 2317.895 8808 8808.000 8808
# 2: 8768.000 2609.524 8768 8768.000 8768
# 3: 8744.000 3168.116 8744 8744.000 8744
# 4: 7006.452 3810.526 8688 3620.000 8688
# 5: 5794.595 4660.870 8576 2144.000 8576
# 6: 6057.143 5888.889 8480 2208.333 8480
# 7: 7036.667 7279.310 8444 2814.667 8444
# 8: 8107.692 8107.692 8432 3634.483 8432
# 9: 8138.462 8464.000 8464 3992.453 8464
#10: 8173.077 8500.000 8500 4427.083 8500
The above code could be rewritten to use the set()
function: 可以重写以上代码以使用
set()
函数:
for (j in cols_to_change) {
set(loadProfiles, ,j = j, value = pmin(loadProfiles[[j]], loadProfiles[["maxICA"]]))
}
Inspired by Frank's comment I was wondering what the best approach is in terms of performance. 受弗兰克评论的启发,我想知道在性能方面最好的方法是什么。 For benchmarking, a data.table with 100000 rows is created by replicating the OP's data.
为了进行基准测试,通过复制OP的数据来创建具有100000行的data.table。
# create data.table with 100 000 rows
lp <- copy(loadProfiles0)
dummy <- lapply(1:4, function(x) lp <<-
rbindlist(list(lp, lp, lp, lp, lp, lp, lp, lp, lp, lp)))
nrow(lp)
#100000
As all approaches modify the loadProfiles
in place, we need to take a copy before each run. 由于所有方法都修改了
loadProfiles
,因此我们需要在每次运行之前制作一份副本。 The copy operation is also benchmarked for comparison. 复制操作也以基准进行比较。
microbenchmark::microbenchmark(
copy = loadProfiles <- copy(lp),
chris = {
loadProfiles <- copy(lp)
for (i in cols_to_change) {
loadProfiles[get(i) >= maxICA, eval(i) := as.double(maxICA)]
}
},
frank = {
loadProfiles <- copy(lp)
for (i in cols_to_change) {
loadProfiles[get(i) >= maxICA, (i) := as.double(maxICA)]
}
},
uwe = {
loadProfiles <- copy(lp)
loadProfiles[, (cols_to_change) := lapply(.SD, function(x) pmin(x, maxICA)),
.SDcols = cols_to_change]
},
set = {
loadProfiles <- copy(lp)
for (j in cols_to_change) {
set(loadProfiles, , j = j, value = pmin(loadProfiles[[j]], loadProfiles[["maxICA"]]))
}
}
)
Results: 结果:
#Unit: microseconds
# expr min lq mean median uq max neval
# copy 592.427 1007.012 1170.425 1111.224 1238.281 3977.826 100
# chris 8525.045 10614.394 12704.450 11499.447 12152.475 140577.520 100
# frank 4972.000 6799.118 8566.945 7339.060 7819.344 133202.589 100
# uwe 4201.354 6297.689 6711.409 6585.595 6914.846 10546.996 100
# set 3716.539 5580.662 7138.738 5907.836 6264.840 127311.557 100
Frank's suggestion to remove eval()
from christoph's solution has gained a remarkable speed increase. 弗兰克(Frank
eval()
从christoph 解决方案中删除eval()
的建议获得了显着的速度提高。 However, the other two solutions are still faster with set
slightly ahead. 但是,其他两种解决方案仍然较快,但
set
稍早。
loadProfiles0 <- fread("load_ev_ag load_ev_res load_ev_res_tou load_ev_workplace maxICA
8469.231 2317.895 36700.00 220200.000 8808
8768.000 2609.524 36533.33 36533.333 8768
8744.000 3168.116 27325.00 10409.524 8744
7006.452 3810.526 24133.33 3620.000 8688
5794.595 4660.870 19490.91 2144.000 8576
6057.143 5888.889 16307.69 2208.333 8480
7036.667 7279.310 14073.33 2814.667 8444
8107.692 8107.692 14053.33 3634.483 8432
8138.462 9200.000 11755.56 3992.453 8464
8173.077 10625.000 10119.05 4427.083 8500")
Your first attempt was almost right: 您的第一次尝试几乎是正确的:
profilenames <- names(loadProfiles)[1:4]
for (i in profilenames) {
loadProfiles[get(i) >= maxICA, eval(i) := as.double(maxICA)]
}
You could also solve this with lapply
and ifelse
, even valid for data.frames
: 您也可以使用
lapply
和ifelse
解决此ifelse
,甚至对data.frames
也有效:
loadProfiles[loadProfileNames] <- lapply(loadProfiles[loadProfileNames],
function (i) ifelse (i >= loadProfiles$maxICA, loadProfiles$maxICA, i))
And for data.tables
, the .SD
variable is a good resource: 对于
data.tables
, .SD
变量是一个很好的资源:
loadProfile[, lapply(.SD, function(i) ifelse(i >= maxICA, maxICA, i)), .SDcols = loadProfileNames]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.