[英]How to apply a function to a subset of data.table using by and exposing all columns to the function?
When slicing a data.table
by group(s), variables used to slice the data are not in the subset during the function execution. 在按组切片
data.table
,用于切片数据的变量在函数执行期间不在子集中。 I demonstrate this using debugonce
. 我使用
debugonce
演示了这debugonce
。
library(data.table)
x <- data.table(a = rep(letters[1:4], each = 3), b = rep(c("a", "b"), each = 6), c = rnorm(12))
myfun <- function(y) paste(y$a, y$b, y$c, collapse = "")
> debugonce(myfun)
> x[, myfun(.SD), by = .(b, a)]
debugging in: myfun(.SD)
debug: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
c
1: -1.2662416
2: 0.9818497
3: -0.5395385
What I'm after is the functionality of the split-sapply paradigm, where I would slice a data.frame according to factor(s) and apply the function to all columns, that is, also including the variables which have been used to slice it (demonstrated below). 我所追求的是split-sapply范例的功能,我将根据factor(s)切片data.frame并将该函数应用于所有列,也就是说,还包括用于切片的变量它(如下所示)。
> debugonce(myfun)
> sapply(split(x, f = list(x$b, x$a)), FUN = myfun)
debugging in: FUN(X[[i]], ...)
debug: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
a b c
1: a a -1.2662416
2: a a 0.9818497
3: a a -0.5395385
The OP has a function which takes a list as argument which should contain all columns of the data.table including the columns used for grouping in by
. 该任择议定书有一个函数,它接受一个列表,其中应包含data.table包括用于分组中列的所有列参数
by
。
According to help(".SD")
: 根据
help(".SD")
:
.SD
is a data.table containing the Subset ofx
's Data for each group, excluding any columns used inby
(orkeyby
)..SD
是一个data.table,包含每个组的x
的数据子集, 不包括by
(或keyby
)中使用的任何列。
(emphasis mine) (强调我的)
.BY
is a list containing a length 1 vector for each item inby
..BY
是包含用于每个项目的长度1向量列表中的by
。 This can be useful whenby
is not known in advance.当事先不知道时
by
这可能是有用的。
So, .BY
and .SD
complement each other to access all columns of the data.table. 因此,
.BY
和.SD
补充以访问data.table的所有列。
Instead of explicitely repeating the by
columns in the function call 而不是在函数调用中明确重复
by
列
x[, myfun(c(list(b, a), .SD)), by = .(b, a)]
we can use 我们可以用
x[, myfun(c(.BY, .SD)), by = .(b, a)]
ba V1 1: aaaa -1.02091215130492aa -0.295107569536843aa 0.77776326093429 2: abba -0.369037832486311ba -0.716211663822323ba -0.264799143319049 3: bccb -1.39603530693486cb 1.4707902839894cb 0.721925347069227 4: bddb -1.15220308230505db -0.736782242593426db 0.420986999145651
The OP has used debugonce()
to show the argument passed to myfun()
: OP使用了
debugonce()
来显示传递给myfun()
的参数:
> debugonce(myfun)
> x[, myfun(c(.BY, .SD)), by = .(b, a)]
debugging in: myfun(c(.BY, .SD))
debug at #1: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
$b
[1] "a"
$a
[1] "a"
$c
[1] -1.0209122 -0.2951076 0.7777633
With another sample data set and function it might be easier to exemplify the core of the question: 使用另一个示例数据集和函数,可能更容易举例说明问题的核心:
x <- data.table(a = rep(letters[3:6], each = 3), b = rep(c("x", "y"), each = 6), c = 1:12)
myfun <- function(y) paste(y$a, y$b, y$c, sep = "/", collapse = "-")
x[, myfun(.SD), by = .(b, a)]
ba V1 1: xc //1-//2-//3 2: xd //4-//5-//6 3: ye //7-//8-//9 4: yf //10-//11-//12
So, columns b
and a
do appear in the output as grouping variables but they aren't passed via .SD
to the function. 因此,列
b
和a
确实在输出中显示为分组变量,但它们不会通过.SD
传递给函数。
Now, with .BY
complementing .SD
现在,用
.BY
补充.SD
x[, myfun(c(.BY, .SD)), by = .(b, a)]
ba V1 1: xcc/x/1-c/x/2-c/x/3 2: xdd/x/4-d/x/5-d/x/6 3: yee/y/7-e/y/8-e/y/9 4: yff/y/10-f/y/11-f/y/12
all columns of the data.table are passed to the function. data.table的所有列都传递给函数。
Roland has suggested to pass .BY
and .SD
as separate parameters to the function. Roland建议将
.BY
和.SD
作为单独的参数传递给函数。 Indeed, .BY
is a list object and .SD
is a data.table object (which essentially is also a list which allowed us to use c(.BY, .SD)
). 实际上,
.BY
是一个列表对象,而.SD
是一个data.table对象(它本质上也是一个允许我们使用c(.BY, .SD)
)。 There might be cases where the difference might matter. 可能存在差异可能很重要的情况。
To verify, we can define a function which prints str()
as a side effect. 为了验证,我们可以定义一个打印
str()
作为副作用的函数。 The function is only called for the first group ( .GRP == 1L
). 仅为第一组调用该函数(
.GRP == 1L
)。
myfun1 <- function(y) str(y)
x[, if (.GRP == 1L) myfun1(.SD), by = .(b, a)]
Classes 'data.table' and 'data.frame': 3 obs. of 1 variable: $ c: int 1 2 3 - attr(*, ".internal.selfref")=<externalptr> - attr(*, ".data.table.locked")= logi TRUE Empty data.table (0 rows) of 2 cols: b,a
x[, if (.GRP == 1L) myfun1(.BY), by = .(b, a)]
List of 2 $ b: chr "x" $ a: chr "c" Empty data.table (0 rows) of 2 cols: b,a
x[, if (.GRP == 1L) myfun1(c(.BY, .SD)), by = .(b, a)]
List of 3 $ b: chr "x" $ a: chr "c" $ c: int [1:3] 1 2 3 Empty data.table (0 rows) of 2 cols: b,a
Beside help(".SD")
the comments & answers to the following SO questions might by useful: 除了
help(".SD")
,对以下SO问题的评论和回答可能有用:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.