[英]Use of lapply .SD in data.table R
I am not very clear about use of .SD
and by
.我不太清楚.SD
和by
使用。
For instance, does the below snippet mean: 'change all the columns in DT
to factor except A
and B
?'例如,下面的代码片段是否意味着:“将DT
中的所有列更改为除A
和B
之外的因子?” It also says in data.table
manual: " .SD
refers to the Subset of the data.table
for each group (excluding the grouping columns)" - so columns A
and B
are excluded?它还在data.table
手册中说:“ .SD
指的是每个组的data.table
的子集(不包括分组列)” - 所以列A
和B
被排除在外?
DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]
However, I also read that by
means like 'group by' in SQL when you do aggregation.但是,当您进行聚合时,我也通过 SQL 中的“分组依据”之类by
方式阅读了它。 For instance, if I would like to sum (like colsum
in SQL) over all the columns except A
and B
do I still use something similar?例如,如果我想对除A
和B
之外的所有列求和(如 SQL 中的colsum
),我是否仍使用类似的东西? Or in this case, does the below code mean to take the sum and group by values in columns A
and B
?或者在这种情况下,下面的代码是否意味着对A
列和B
列中的值进行求和和分组? (take sum and group by A,B
as in SQL) (像在 SQL 中一样按A,B
求和和分组)
DT[,lapply(.SD,sum),by=.(A,B)]
Then how do I do a simple colsum
over all the columns except A
and B
?那么我该如何对除A
和B
之外的所有列进行简单的colsum
呢?
Just to illustrate the comments above with an example, let's take只是为了用一个例子来说明上面的评论,让我们来看看
set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2
Compare the following:比较以下内容:
#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14
A few notes:一些注意事项:
DT
..."你说“下面的代码片段......改变了DT
中的所有列......” The answer is no , and this is very important for data.table
.答案是否定的,这对data.table
非常重要。 The object returned is a new data.table
, and all of the columns in DT
are exactly as they were before running the code.返回的对象是一个新的data.table
,并且DT
中的所有列都与运行代码之前完全一样。
Referring to the point above again, note that your code ( DT[, lapply(.SD, as.factor)]
) returns a new data.table
and does not change DT
at all.再次参考上面的观点,请注意您的代码( DT[, lapply(.SD, as.factor)]
)返回一个新的data.table
并且根本不更改DT
。 One ( incorrect ) way to do this, which is done with data.frame
s in base
, is to overwrite the old data.table
with the new data.table
you've returned, ie, DT = DT[, lapply(.SD, as.factor)]
.一种(不正确的)方法是用base
中的data.frame
s 完成此操作,即用您返回的新data.table
覆盖旧data.table
,即DT = DT[, lapply(.SD, as.factor)]
。
This is wasteful because it involves creating copies of DT
which can be an efficiency killer when DT
is large.这是一种浪费,因为它涉及创建DT
的副本,当DT
很大时,这可能会成为效率杀手。 The correct data.table
approach to this problem is to update the columns by reference using `:=`
, eg, DT[, names(DT):= lapply(.SD, as.factor)]
, which creates no copies of your data.解决此问题的正确data.table
方法是使用`:=`
通过引用更新列,例如DT[, names(DT):= lapply(.SD, as.factor)]
,这不会创建您的副本数据。 See data.table
's reference semantics vignette for more on this.有关更多信息,请参阅data.table
的参考语义小插图。
lapply(.SD, sum)
to that of colSums
.您提到了将lapply(.SD, sum)
的效率与colSums
的效率进行比较。 sum
is internally optimized in data.table
(you can note this is true from the output of adding the verbose = TRUE
argument within []
); sum
在data.table
中进行了内部优化(您可以从在[]
中添加verbose = TRUE
参数的输出中注意到这是正确的); to see this in action, let's beef up your DT
a bit and run a benchmark:为了实际看到这一点,让我们稍微加强一下DT
并运行一个基准测试:Results:结果:
library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]
library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.