[英]Assignment by group in R data.table
Consider the following data:考虑以下数据:
data <- data.table(ID = 1:10, x = c(1,1,0,0,0,1,1,0,0,1), y = c(0,0,1,1,1,1,1,0,1,1))
I want to assign the following value in a column new
: For each pair of x
and y
(eg x
= 1 and y
= 1), find the row highest up in the list where this specific pair occurs and let new
be the ID
of this row.我想在new
列中分配以下值:对于每对x
和y
(例如x
= 1 和y
= 1),找到该特定对出现的列表中最高的行,并让new
成为的ID
这一行。 For example, for all rows where x
= 1 and y
= 1, I want to new
to be 6.例如,对于x
= 1 和y
= 1 的所有行,我希望new
为 6。
The following line of code seems to do precisely that:以下代码行似乎正是这样做的:
data[, new := head(.SD, 1), .(x, y)]
My question is just, why does this work?我的问题是,为什么这行得通? head(.SD, 1)
will be a list containing one row, how can R know I want to assign specifically the value in the first column of head(.SD, 1)
to new
? head(.SD, 1)
将是一个包含一行的列表,R 如何知道我想将head(.SD, 1)
的第一列中的值专门分配给new
? I was expecting an error when trying to run this code, but I do actually get the desired output.尝试运行此代码时,我预计会出现错误,但我确实得到了所需的输出。
This is a very interesting observation.这是一个非常有趣的观察。
First, I want to clarify that when you do by = .(x, y)
, your .SD
consists of only one column ( ID
).首先,我想澄清一下,当您执行by = .(x, y)
,您的.SD
仅包含一列 ( ID
)。 You can test that by comparing these two lines of code below.您可以通过比较下面的这两行代码来测试。
This first line asks for the first column of .SD
, and it works:第一行要求.SD
的第一列,它有效:
data[, head(.SD[, 1]), .(x, y)]
x y ID
1: 1 0 1
2: 1 0 2
3: 0 1 3
4: 0 1 4
5: 0 1 5
6: 0 1 9
7: 1 1 6
8: 1 1 7
9: 1 1 10
10: 0 0 8
But the second line below asks for the second column of .SD
and it gets an error:但是下面的第二行要求.SD
的第二列,它得到一个错误:
data[, head(.SD[, 2]), .(x, y)]
Error in `[.data.table`(.SD, , 2) :
Item 1 of j is 2 which is outside the column number range [1,ncol=1]
You see, .SD
has only one column ( ID
).你看, .SD
只有一列( ID
)。 It does not contain the columns in by
.它不包含by
的列。 That's why your code works as expected.这就是您的代码按预期工作的原因。
However, your observation is still valid.但是,您的观察仍然有效。 Consider the expanded data.table
with two columns beside x
and y
.考虑在x
和y
旁边有两列的扩展data.table
。
data <- data.table(ID1 = 1:10,
ID2 = letters[1:10],
x = c(1,1,0,0,0,1,1,0,0,1),
y = c(0,0,1,1,1,1,1,0,1,1))
data[, new := head(.SD, 1), .(x, y)]
data
ID1 ID2 x y new
1: 1 a 1 0 1
2: 2 b 1 0 1
3: 3 c 0 1 3
4: 4 d 0 1 3
5: 5 e 0 1 3
6: 6 f 1 1 6
7: 7 g 1 1 6
8: 8 h 0 0 8
9: 9 i 0 1 3
10: 10 j 1 1 6
new
takes ID1
only. new
只需要ID1
。 Why?为什么? According to the help page of :=
, when there is a type mismatch between the LHS and the RHS of :=
, "the RHS is coerced to match type of the LHS" .据帮助页面:=
,当有在LHS和RHS中之间的类型不匹配:=
“的RHS被强制以匹配LHS的类型”。 In this case the RHS is a list of 3, and the LHS is a list of 1, so only the first element is taken.在这种情况下,RHS 是一个 3 的列表,LHS 是一个 1 的列表,所以只取第一个元素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.