如何将函数应用于data.table的子集，使用by并将所有列公开给函数？

Question

When slicing a data.table by group(s), variables used to slice the data are not in the subset during the function execution. 在按组切片data.table ，用于切片数据的变量在函数执行期间不在子集中。 I demonstrate this using debugonce . 我使用debugonce演示了这debugonce 。

library(data.table)
x <- data.table(a = rep(letters[1:4], each = 3), b = rep(c("a", "b"), each = 6), c = rnorm(12))

myfun <- function(y) paste(y$a, y$b, y$c, collapse = "")

> debugonce(myfun)
> x[, myfun(.SD), by = .(b, a)]
debugging in: myfun(.SD)
debug: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
            c
1: -1.2662416
2:  0.9818497
3: -0.5395385

What I'm after is the functionality of the split-sapply paradigm, where I would slice a data.frame according to factor(s) and apply the function to all columns, that is, also including the variables which have been used to slice it (demonstrated below). 我所追求的是split-sapply范例的功能，我将根据factor（s）切片data.frame并将该函数应用于所有列，也就是说，还包括用于切片的变量它（如下所示）。

> debugonce(myfun)

> sapply(split(x, f = list(x$b, x$a)), FUN = myfun)
debugging in: FUN(X[[i]], ...)
debug: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
a b          c
1: a a -1.2662416
2: a a  0.9818497
3: a a -0.5395385

Answer 1

The OP has a function which takes a list as argument which should contain all columns of the data.table including the columns used for grouping in by . 该任择议定书有一个函数，它接受一个列表，其中应包含data.table包括用于分组中列的所有列参数by 。

According to help(".SD") : 根据help(".SD") ：

.SD is a data.table containing the Subset of x 's Data for each group, excluding any columns used in by (or keyby ). .SD是一个data.table，包含每个组的x的数据子集， 不包括 by （或keyby ）中使用的任何列。

(emphasis mine) （强调我的）

.BY is a list containing a length 1 vector for each item in by . .BY是包含用于每个项目的长度1向量列表中的by 。 This can be useful when by is not known in advance. 当事先不知道时by这可能是有用的。

So, .BY and .SD complement each other to access all columns of the data.table. 因此， .BY和.SD补充以访问data.table的所有列。

Instead of explicitely repeating the by columns in the function call 而不是在函数调用中明确重复by列

x[, myfun(c(list(b, a), .SD)), by = .(b, a)]

we can use 我们可以用

x[, myfun(c(.BY, .SD)), by = .(b, a)]

  ba V1 1: aaaa -1.02091215130492aa -0.295107569536843aa 0.77776326093429 2: abba -0.369037832486311ba -0.716211663822323ba -0.264799143319049 3: bccb -1.39603530693486cb 1.4707902839894cb 0.721925347069227 4: bddb -1.15220308230505db -0.736782242593426db 0.420986999145651

The OP has used debugonce() to show the argument passed to myfun() : OP使用了debugonce()来显示传递给myfun()的参数：

> debugonce(myfun)
> x[, myfun(c(.BY, .SD)), by = .(b, a)]
debugging in: myfun(c(.BY, .SD))
debug at #1: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
$b
[1] "a"

$a
[1] "a"

$c
[1] -1.0209122 -0.2951076  0.7777633

Another example 另一个例子

With another sample data set and function it might be easier to exemplify the core of the question: 使用另一个示例数据集和函数，可能更容易举例说明问题的核心：

x <- data.table(a = rep(letters[3:6], each = 3), b = rep(c("x", "y"), each = 6), c = 1:12)
myfun <- function(y) paste(y$a, y$b, y$c, sep = "/", collapse = "-")

x[, myfun(.SD), by = .(b, a)]

  ba V1 1: xc //1-//2-//3 2: xd //4-//5-//6 3: ye //7-//8-//9 4: yf //10-//11-//12

So, columns b and a do appear in the output as grouping variables but they aren't passed via .SD to the function. 因此，列b和a确实在输出中显示为分组变量，但它们不会通过.SD传递给函数。

Now, with .BY complementing .SD 现在，用.BY补充.SD

x[, myfun(c(.BY, .SD)), by = .(b, a)]

  ba V1 1: xcc/x/1-c/x/2-c/x/3 2: xdd/x/4-d/x/5-d/x/6 3: yee/y/7-e/y/8-e/y/9 4: yff/y/10-f/y/11-f/y/12

all columns of the data.table are passed to the function. data.table的所有列都传递给函数。

Separate arguments in the function call 函数调用中的单独参数

Roland has suggested to pass .BY and .SD as separate parameters to the function. Roland建议将.BY和.SD作为单独的参数传递给函数。 Indeed, .BY is a list object and .SD is a data.table object (which essentially is also a list which allowed us to use c(.BY, .SD) ). 实际上， .BY是一个列表对象，而.SD是一个data.table对象（它本质上也是一个允许我们使用c(.BY, .SD) ）。 There might be cases where the difference might matter. 可能存在差异可能很重要的情况。

To verify, we can define a function which prints str() as a side effect. 为了验证，我们可以定义一个打印str()作为副作用的函数。 The function is only called for the first group ( .GRP == 1L ). 仅为第一组调用该函数（ .GRP == 1L ）。

myfun1 <- function(y) str(y)
x[, if (.GRP == 1L) myfun1(.SD), by = .(b, a)]

 Classes 'data.table' and 'data.frame': 3 obs. of 1 variable: $ c: int 1 2 3 - attr(*, ".internal.selfref")=<externalptr> - attr(*, ".data.table.locked")= logi TRUE Empty data.table (0 rows) of 2 cols: b,a

x[, if (.GRP == 1L) myfun1(.BY), by = .(b, a)]

 List of 2 $ b: chr "x" $ a: chr "c" Empty data.table (0 rows) of 2 cols: b,a

x[, if (.GRP == 1L) myfun1(c(.BY, .SD)), by = .(b, a)]

 List of 3 $ b: chr "x" $ a: chr "c" $ c: int [1:3] 1 2 3 Empty data.table (0 rows) of 2 cols: b,a

Additional links 其他链接

Beside help(".SD") the comments & answers to the following SO questions might by useful: 除了help(".SD") ，对以下SO问题的评论和回答可能有用：

如何将函数应用于data.table的子集，使用by并将所有列公开给函数？

问题描述

1 个解决方案

解决方案1
15 已采纳 2017-07-25 06:38:37

Another example 另一个例子

Separate arguments in the function call 函数调用中的单独参数

Additional links 其他链接

如何将函数应用于data.table的子集，使用by并将所有列公开给函数？

问题描述

1 个解决方案

解决方案1 15 已采纳 2017-07-25 06:38:37

Another example 另一个例子

Separate arguments in the function call 函数调用中的单独参数

Additional links 其他链接

解决方案1
15 已采纳 2017-07-25 06:38:37