使用 .SDcols 时，data.table 可以处理相同的列名吗？

Question

When using .SD to apply a function to a subset of dt 's columns I can't seem to find the correct way to handle the situation where I have duplicated column names... eg当使用.SD将函数应用于dt列的子集时，我似乎找不到正确的方法来处理我有重复列名的情况......例如

#  Make some data
set.seed(123)
dt <- data.table( matrix( sample(6,16,repl=T) , 4 ) )
setnames(dt , rep( letters[1:2] , 2 ) )
#   a b a b
#1: 2 6 4 5
#2: 5 1 3 4
#3: 3 4 6 1
#4: 6 6 3 6

#  Use .SDcols to multiply both column 'a' specifying them by numeric position
dt[ , lapply( .SD , `*`  , 2 ) , .SDcols = which( names(dt) %in% "a" ) ]
#    a  a
#1:  4  4
#2: 10 10
#3:  6  6
#4: 12 12

I couldn't get it to work with when .SDcols was a character vector of column names so I tried numeric positions ( which( names(dt) %in% "a" ) gives a vector [1] 1 3 ) but it also seems to just multiply the first a column only.当.SDcols是列名的字符向量时，我无法使用它，所以我尝试了数字位置（ which( names(dt) %in% "a" )给出了一个向量[1] 1 3 ）但它也似乎只是乘以第a列而已。 Am I doing something wrong?难道我做错了什么？

.SDcols Advanced. .SDcols高级版。 Specifies the columns of x included in .SD.指定包含在 .SD 中的 x 列。 May be character column names or numeric positions.可能是字符列名称或数字位置。

These also returned the same result as above...这些也返回了与上面相同的结果......

dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = which( names(dt) %in% "a" ) ]
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]

packageVersion("data.table")
#[1] ‘1.8.11’

Answer 1

How about this这个怎么样

dt[, "a"] * 2
##    a a.1
## 1  4   8
## 2 10   6
## 3  6  12
## 4 12   6

For more detailed discussion更详细的讨论

https://chat.stackoverflow.com/transcript/message/12783493#12783493 https://chat.stackoverflow.com/transcript/message/12783493#12783493

Answer 2

This now works as intended since 1.9.4.从 1.9.4 开始，这现在按预期工作。 From NEWS:来自新闻：

Consistent subset rules on data.tables with duplicate columns.具有重复列的data.tables上的一致子集规则。 In short, if indices are directly provided, 'j', or in .SDcols , then just those columns are either returned (or deleted if you provide - .SDcols or !j ).简而言之，如果直接提供索引、'j' 或在.SDcols ，则仅返回这些列（如果提供 - .SDcols或!j .SDcols删除）。 If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset.如果相反，给出了列名并且该列出现不止一次，那么就很难决定在子集上保留哪些以及删除哪些。 Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time.因此，要删除，该列的所有出现都将被删除，并且要保持，每次总是返回第一列。 Also closes #5688 and #5008 .也关闭#5688和#5008 。 Note that using by= to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.请注意，使用by=在重复的列上聚合可能仍然不会给出预期的结果，因为它可能无法对正确的列进行操作。

Basically, if you do:基本上，如果你这样做：

dt[, lapply(.SD, `*`, 2), .SDcols=c("a", "a")]
#     a  a
# 1:  4  4
# 2: 10 10
# 3:  6  6
# 4: 12 12

It'll still give the unintended result, as it's hard to tell which "a" you're mentioning each time - so choosing the first always.它仍然会给出意想不到的结果，因为很难分辨你每次提到的是哪个“a”——所以总是选择第一个。

But if you clearly specify (as you do in your Q):但是，如果您明确指定（如您在 Q 中所做的那样）：

dt[, lapply(.SD, `*`, 2), .SDcols=which( names(dt) %in% "a" )]
#     a  a
# 1:  4  8
# 2: 10  6
# 3:  6 12
# 4: 12  6

使用 .SDcols 时，data.table 可以处理相同的列名吗？

问题描述

2 个解决方案

解决方案1
1 2013-11-06 12:11:23

解决方案2
1 2014-06-19 01:32:33

This now works as intended since 1.9.4.从 1.9.4 开始，这现在按预期工作。 From NEWS:来自新闻：

使用 .SDcols 时，data.table 可以处理相同的列名吗？

问题描述

2 个解决方案

解决方案1 1 2013-11-06 12:11:23

解决方案2 1 2014-06-19 01:32:33

This now works as intended since 1.9.4.从 1.9.4 开始，这现在按预期工作。 From NEWS:来自新闻：

解决方案1
1 2013-11-06 12:11:23

解决方案2
1 2014-06-19 01:32:33