简体   繁体   中英

Assignment by group in R data.table

Consider the following data:

data <- data.table(ID = 1:10, x = c(1,1,0,0,0,1,1,0,0,1), y = c(0,0,1,1,1,1,1,0,1,1))

I want to assign the following value in a column new : For each pair of x and y (eg x = 1 and y = 1), find the row highest up in the list where this specific pair occurs and let new be the ID of this row. For example, for all rows where x = 1 and y = 1, I want to new to be 6.

The following line of code seems to do precisely that:

data[, new := head(.SD, 1), .(x, y)]

My question is just, why does this work? head(.SD, 1) will be a list containing one row, how can R know I want to assign specifically the value in the first column of head(.SD, 1) to new ? I was expecting an error when trying to run this code, but I do actually get the desired output.

This is a very interesting observation.

First, I want to clarify that when you do by = .(x, y) , your .SD consists of only one column ( ID ). You can test that by comparing these two lines of code below.

This first line asks for the first column of .SD , and it works:

data[, head(.SD[, 1]), .(x, y)]
    x y ID
 1: 1 0  1
 2: 1 0  2
 3: 0 1  3
 4: 0 1  4
 5: 0 1  5
 6: 0 1  9
 7: 1 1  6
 8: 1 1  7
 9: 1 1 10
10: 0 0  8

But the second line below asks for the second column of .SD and it gets an error:

data[, head(.SD[, 2]), .(x, y)]
Error in `[.data.table`(.SD, , 2) : 
  Item 1 of j is 2 which is outside the column number range [1,ncol=1]

You see, .SD has only one column ( ID ). It does not contain the columns in by . That's why your code works as expected.

However, your observation is still valid. Consider the expanded data.table with two columns beside x and y .

data <- data.table(ID1 = 1:10, 
                   ID2 = letters[1:10],
                   x = c(1,1,0,0,0,1,1,0,0,1),
                   y = c(0,0,1,1,1,1,1,0,1,1))
data[, new := head(.SD, 1), .(x, y)]
data
    ID1 ID2 x y new
 1:   1   a 1 0   1
 2:   2   b 1 0   1
 3:   3   c 0 1   3
 4:   4   d 0 1   3
 5:   5   e 0 1   3
 6:   6   f 1 1   6
 7:   7   g 1 1   6
 8:   8   h 0 0   8
 9:   9   i 0 1   3
10:  10   j 1 1   6

new takes ID1 only. Why? According to the help page of := , when there is a type mismatch between the LHS and the RHS of := , "the RHS is coerced to match type of the LHS" . In this case the RHS is a list of 3, and the LHS is a list of 1, so only the first element is taken.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM