简体   繁体   English

在 data.table R 中使用 lapply.SD

[英]Use of lapply .SD in data.table R

I am not very clear about use of .SD and by .我不太清楚.SDby使用。

For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B ?'例如,下面的代码片段是否意味着:“将DT中的所有列更改为除AB之外的因子?” It also says in data.table manual: " .SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?它还在data.table手册中说:“ .SD指的是每个组的data.table的子集(不包括分组列)” - 所以列AB被排除在外?

DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]

However, I also read that by means like 'group by' in SQL when you do aggregation.但是,当您进行聚合时,我也通过 SQL 中的“分组依据”之类by方式阅读了它。 For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar?例如,如果我想对除AB之外的所有列求和(如 SQL 中的colsum ),我是否仍使用类似的东西? Or in this case, does the below code mean to take the sum and group by values in columns A and B ?或者在这种情况下,下面的代码是否意味着对A列和B列中的值进行求和和分组? (take sum and group by A,B as in SQL) (像在 SQL 中一样按A,B求和和分组)

DT[,lapply(.SD,sum),by=.(A,B)]

Then how do I do a simple colsum over all the columns except A and B ?那么我该如何对除AB之外的所有列进行简单的colsum呢?

Just to illustrate the comments above with an example, let's take只是为了用一个例子来说明上面的评论,让我们来看看

set.seed(10238)
# A and B are the "id" variables within which the
#   "data" variables C and D vary meaningfully
DT = data.table(
  A = rep(1:3, each = 5L), 
  B = rep(1:5, 3L),
  C = sample(15L),
  D = sample(15L)
)
DT
#     A B  C  D
#  1: 1 1 14 11
#  2: 1 2  3  8
#  3: 1 3 15  1
#  4: 1 4  1 14
#  5: 1 5  5  9
#  6: 2 1  7 13
#  7: 2 2  2 12
#  8: 2 3  8  6
#  9: 2 4  9 15
# 10: 2 5  4  3
# 11: 3 1  6  5
# 12: 3 2 12 10
# 13: 3 3 10  4
# 14: 3 4 13  7
# 15: 3 5 11  2

Compare the following:比较以下内容:

#Sum all columns
DT[ , lapply(.SD, sum)]
#     A  B   C   D
# 1: 30 45 120 120

#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
#    A  B  C  D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28

#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
#     B   C   D
# 1: 45 120 120

#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
#    B  C  D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14

A few notes:一些注意事项:

  • You said "does the below snippet... change all the columns in DT ..."你说“下面的代码片段......改变了DT中的所有列......”

The answer is no , and this is very important for data.table .答案是否定的,这对data.table非常重要。 The object returned is a new data.table , and all of the columns in DT are exactly as they were before running the code.返回的对象是一个data.table ,并且DT中的所有列都与运行代码之前完全一样。

  • You mentioned wanting to change the column types您提到要更改列类型

Referring to the point above again, note that your code ( DT[, lapply(.SD, as.factor)] ) returns a new data.table and does not change DT at all.再次参考上面的观点,请注意您的代码( DT[, lapply(.SD, as.factor)] )返回一个data.table并且根本不更改DT One ( incorrect ) way to do this, which is done with data.frame s in base , is to overwrite the old data.table with the new data.table you've returned, ie, DT = DT[, lapply(.SD, as.factor)] .一种(不正确的)方法是用base中的data.frame s 完成此操作,即用您返回的新data.table覆盖旧data.table ,即DT = DT[, lapply(.SD, as.factor)]

This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large.这是一种浪费,因为它涉及创建DT的副本,当DT很大时,这可能会成为效率杀手。 The correct data.table approach to this problem is to update the columns by reference using `:=` , eg, DT[, names(DT):= lapply(.SD, as.factor)] , which creates no copies of your data.解决此问题的正确data.table方法是使用`:=`通过引用更新列,例如DT[, names(DT):= lapply(.SD, as.factor)] ,这不会创建您的副本数据。 See data.table 's reference semantics vignette for more on this.有关更多信息,请参阅data.table的参考语义小插图

  • You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums .您提到了将lapply(.SD, sum)的效率与colSums的效率进行比较。 sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within [] ); sumdata.table中进行了内部优化(您可以从在[]中添加verbose = TRUE参数的输出中注意到这是正确的); to see this in action, let's beef up your DT a bit and run a benchmark:为了实际看到这一点,让我们稍微加强一下DT并运行一个基准测试:

Results:结果:

library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]

library(microbenchmark)
microbenchmark(
  times = 100L,
  colsums = colSums(DT[ , !c("A", "B")]),
  lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962   100
#  lapplys  246.5824  250.3753  252.9603  252.1586  254.8297  266.1771   100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM