简体   繁体   English

使用 .SDcols 时,data.table 可以处理相同的列名吗?

[英]Can data.table handle identical column names when using .SDcols?

When using .SD to apply a function to a subset of dt 's columns I can't seem to find the correct way to handle the situation where I have duplicated column names... eg当使用.SD将函数应用于dt列的子集时,我似乎找不到正确的方法来处理我有重复列名的情况......例如

#  Make some data
set.seed(123)
dt <- data.table( matrix( sample(6,16,repl=T) , 4 ) )
setnames(dt , rep( letters[1:2] , 2 ) )
#   a b a b
#1: 2 6 4 5
#2: 5 1 3 4
#3: 3 4 6 1
#4: 6 6 3 6

#  Use .SDcols to multiply both column 'a' specifying them by numeric position
dt[ , lapply( .SD , `*`  , 2 ) , .SDcols = which( names(dt) %in% "a" ) ]
#    a  a
#1:  4  4
#2: 10 10
#3:  6  6
#4: 12 12

I couldn't get it to work with when .SDcols was a character vector of column names so I tried numeric positions ( which( names(dt) %in% "a" ) gives a vector [1] 1 3 ) but it also seems to just multiply the first a column only..SDcols是列名的字符向量时,我无法使用它,所以我尝试了数字位置( which( names(dt) %in% "a" )给出了一个向量[1] 1 3 )但它也似乎只是乘以第a列而已。 Am I doing something wrong?难道我做错了什么?

.SDcols Advanced. .SDcols高级版。 Specifies the columns of x included in .SD.指定包含在 .SD 中的 x 列。 May be character column names or numeric positions.可能是字符列名称或数字位置。

These also returned the same result as above...这些也返回了与上面相同的结果......

dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = which( names(dt) %in% "a" ) ]
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]

packageVersion("data.table")
#[1] ‘1.8.11’

How about this这个怎么样

dt[, "a"] * 2
##    a a.1
## 1  4   8
## 2 10   6
## 3  6  12
## 4 12   6

For more detailed discussion更详细的讨论

https://chat.stackoverflow.com/transcript/message/12783493#12783493 https://chat.stackoverflow.com/transcript/message/12783493#12783493

This now works as intended since 1.9.4.从 1.9.4 开始,这现在按预期工作。 From NEWS:来自新闻:

Consistent subset rules on data.tables with duplicate columns.具有重复列的data.tables上的一致子集规则。 In short, if indices are directly provided, 'j', or in .SDcols , then just those columns are either returned (or deleted if you provide - .SDcols or !j ).简而言之,如果直接提供索引、'j' 或在.SDcols ,则仅返回这些列(如果提供 - .SDcols!j .SDcols删除)。 If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset.如果相反,给出了列名并且该列出现不止一次,那么就很难决定在子集上保留哪些以及删除哪些。 Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time.因此,要删除,该列的所有出现都将被删除,并且要保持,每次总是返回第一列。 Also closes #5688 and #5008 .也关闭#5688#5008 Note that using by= to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.请注意,使用by=在重复的列上聚合可能仍然不会给出预期的结果,因为它可能无法对正确的列进行操作。

Basically, if you do:基本上,如果你这样做:

dt[, lapply(.SD, `*`, 2), .SDcols=c("a", "a")]
#     a  a
# 1:  4  4
# 2: 10 10
# 3:  6  6
# 4: 12 12

It'll still give the unintended result, as it's hard to tell which "a" you're mentioning each time - so choosing the first always.它仍然会给出意想不到的结果,因为很难分辨你每次提到的是哪个“a”——所以总是选择第一个。

But if you clearly specify (as you do in your Q):但是,如果您明确指定(如您在 Q 中所做的那样):

dt[, lapply(.SD, `*`, 2), .SDcols=which( names(dt) %in% "a" )]
#     a  a
# 1:  4  8
# 2: 10  6
# 3:  6 12
# 4: 12  6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM