[英]Can data.table handle identical column names when using .SDcols?
When using .SD
to apply a function to a subset of dt
's columns I can't seem to find the correct way to handle the situation where I have duplicated column names... eg当使用
.SD
将函数应用于dt
列的子集时,我似乎找不到正确的方法来处理我有重复列名的情况......例如
# Make some data
set.seed(123)
dt <- data.table( matrix( sample(6,16,repl=T) , 4 ) )
setnames(dt , rep( letters[1:2] , 2 ) )
# a b a b
#1: 2 6 4 5
#2: 5 1 3 4
#3: 3 4 6 1
#4: 6 6 3 6
# Use .SDcols to multiply both column 'a' specifying them by numeric position
dt[ , lapply( .SD , `*` , 2 ) , .SDcols = which( names(dt) %in% "a" ) ]
# a a
#1: 4 4
#2: 10 10
#3: 6 6
#4: 12 12
I couldn't get it to work with when .SDcols
was a character vector of column names so I tried numeric positions ( which( names(dt) %in% "a" )
gives a vector [1] 1 3
) but it also seems to just multiply the first a
column only.当
.SDcols
是列名的字符向量时,我无法使用它,所以我尝试了数字位置( which( names(dt) %in% "a" )
给出了一个向量[1] 1 3
)但它也似乎只是乘以第a
列而已。 Am I doing something wrong?难道我做错了什么?
.SDcols
Advanced..SDcols
高级版。 Specifies the columns of x included in .SD.指定包含在 .SD 中的 x 列。 May be character column names or numeric positions.
可能是字符列名称或数字位置。
These also returned the same result as above...这些也返回了与上面相同的结果......
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = which( names(dt) %in% "a" ) ]
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
packageVersion("data.table")
#[1] ‘1.8.11’
How about this这个怎么样
dt[, "a"] * 2
## a a.1
## 1 4 8
## 2 10 6
## 3 6 12
## 4 12 6
For more detailed discussion更详细的讨论
https://chat.stackoverflow.com/transcript/message/12783493#12783493 https://chat.stackoverflow.com/transcript/message/12783493#12783493
Consistent subset rules on
data.tables
with duplicate columns.具有重复列的
data.tables
上的一致子集规则。 In short, if indices are directly provided, 'j', or in.SDcols
, then just those columns are either returned (or deleted if you provide -.SDcols
or!j
).简而言之,如果直接提供索引、'j' 或在
.SDcols
,则仅返回这些列(如果提供 -.SDcols
或!j
.SDcols
删除)。 If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset.如果相反,给出了列名并且该列出现不止一次,那么就很难决定在子集上保留哪些以及删除哪些。 Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time.
因此,要删除,该列的所有出现都将被删除,并且要保持,每次总是返回第一列。 Also closes #5688 and #5008 .
也关闭#5688和#5008 。 Note that using
by=
to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.请注意,使用
by=
在重复的列上聚合可能仍然不会给出预期的结果,因为它可能无法对正确的列进行操作。
Basically, if you do:基本上,如果你这样做:
dt[, lapply(.SD, `*`, 2), .SDcols=c("a", "a")]
# a a
# 1: 4 4
# 2: 10 10
# 3: 6 6
# 4: 12 12
It'll still give the unintended result, as it's hard to tell which "a" you're mentioning each time - so choosing the first always.它仍然会给出意想不到的结果,因为很难分辨你每次提到的是哪个“a”——所以总是选择第一个。
But if you clearly specify (as you do in your Q):但是,如果您明确指定(如您在 Q 中所做的那样):
dt[, lapply(.SD, `*`, 2), .SDcols=which( names(dt) %in% "a" )]
# a a
# 1: 4 8
# 2: 10 6
# 3: 6 12
# 4: 12 6
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.