在 R data.table 中按组分配

Question

Consider the following data:考虑以下数据：

data <- data.table(ID = 1:10, x = c(1,1,0,0,0,1,1,0,0,1), y = c(0,0,1,1,1,1,1,0,1,1))

I want to assign the following value in a column new : For each pair of x and y (eg x = 1 and y = 1), find the row highest up in the list where this specific pair occurs and let new be the ID of this row.我想在new列中分配以下值：对于每对x和y （例如x = 1 和y = 1），找到该特定对出现的列表中最高的行，并让new成为的ID这一行。 For example, for all rows where x = 1 and y = 1, I want to new to be 6.例如，对于x = 1 和y = 1 的所有行，我希望new为 6。

The following line of code seems to do precisely that:以下代码行似乎正是这样做的：

data[, new := head(.SD, 1), .(x, y)]

My question is just, why does this work?我的问题是，为什么这行得通？ head(.SD, 1) will be a list containing one row, how can R know I want to assign specifically the value in the first column of head(.SD, 1) to new ? head(.SD, 1)将是一个包含一行的列表，R 如何知道我想将head(.SD, 1)的第一列中的值专门分配给new ？ I was expecting an error when trying to run this code, but I do actually get the desired output.尝试运行此代码时，我预计会出现错误，但我确实得到了所需的输出。

Answer 1

This is a very interesting observation.这是一个非常有趣的观察。

First, I want to clarify that when you do by = .(x, y) , your .SD consists of only one column ( ID ).首先，我想澄清一下，当您执行by = .(x, y) ，您的.SD仅包含一列 ( ID )。 You can test that by comparing these two lines of code below.您可以通过比较下面的这两行代码来测试。

This first line asks for the first column of .SD , and it works:第一行要求.SD的第一列，它有效：

data[, head(.SD[, 1]), .(x, y)]

But the second line below asks for the second column of .SD and it gets an error:但是下面的第二行要求.SD的第二列，它得到一个错误：

data[, head(.SD[, 2]), .(x, y)]

Error in `[.data.table`(.SD, , 2) : 
  Item 1 of j is 2 which is outside the column number range [1,ncol=1]

You see, .SD has only one column ( ID ).你看， .SD只有一列（ ID ）。 It does not contain the columns in by .它不包含by的列。 That's why your code works as expected.这就是您的代码按预期工作的原因。

However, your observation is still valid.但是，您的观察仍然有效。 Consider the expanded data.table with two columns beside x and y .考虑在x和y旁边有两列的扩展data.table 。

data <- data.table(ID1 = 1:10, 
                   ID2 = letters[1:10],
                   x = c(1,1,0,0,0,1,1,0,0,1),
                   y = c(0,0,1,1,1,1,1,0,1,1))
data[, new := head(.SD, 1), .(x, y)]
data

    ID1 ID2 x y new
 1:   1   a 1 0   1
 2:   2   b 1 0   1
 3:   3   c 0 1   3
 4:   4   d 0 1   3
 5:   5   e 0 1   3
 6:   6   f 1 1   6
 7:   7   g 1 1   6
 8:   8   h 0 0   8
 9:   9   i 0 1   3
10:  10   j 1 1   6

new takes ID1 only. new只需要ID1 。 Why?为什么？ According to the help page of := , when there is a type mismatch between the LHS and the RHS of := , "the RHS is coerced to match type of the LHS" .据帮助页面:= ，当有在LHS和RHS中之间的类型不匹配:= “的RHS被强制以匹配LHS的类型”。 In this case the RHS is a list of 3, and the LHS is a list of 1, so only the first element is taken.在这种情况下，RHS 是一个 3 的列表，LHS 是一个 1 的列表，所以只取第一个元素。

在 R data.table 中按组分配

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-03-28 03:51:28

在 R data.table 中按组分配

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-03-28 03:51:28

解决方案1
3 已采纳 2020-03-28 03:51:28