简体   繁体   中英

R Find Distance Between Two values By Group

HAVE = data.frame(INSTRUCTOR = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
STUDENT = c(1, 2, 2, 2, 1, 3, 1, 1, 1, 1, 2, 1),
SCORE = c(10, 1, 0, 0, 7, 3, 5, 2, 2, 4, 10, 2),
TIME = c(1,1,2,3,2,1,1,2,3,1,1,2))
 
WANT = data.frame(INSTRUCTOR = c(1, 2, 3), 
SCORE.DIF = c(-9, NA, 6))

For each INSTRUCTOR, I wish to find the SCORE of the first and second STUDENT, and subtract their scores. The STUDENT code varies so I wish not to use '==1' vs '==2'

I try:

HAVE[, .SD[1:2], by = 'INSTRUCTOR']

but do not know how to subtract vertically and obtain 'WANT' data frame from 'HAVE'

library(data.table)
setDT(HAVE)
unique(HAVE, by = c("INSTRUCTOR", "STUDENT")
  )[, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
#    INSTRUCTOR SCORE.DIF
#         <num>     <num>
# 1:          1        -9
# 2:          2        NA
# 3:          3         6

To use your new TIME variable, we can do

HAVE[, .SD[which.min(TIME),], by = .(INSTRUCTOR, STUDENT)
  ][, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
#    INSTRUCTOR SCORE.DIF
#         <num>     <num>
# 1:          1        -9
# 2:          2        NA
# 3:          3         6

One might be tempted to replace SCORE[1:2] with head(SCORE,2) , but that won't work: head(SCORE,2) will return length-1 if the input is length-2, as it is with instructor 2 (who only has one student albeit multiple times). When you run diff on length-1 (eg, diff(1) ), it returns a 0-length vector, which in the above data.table code reduces to zero rows for instructor 2. However , when there is only one student, SCORE[1:2] resolves to c(SCORE[1], NA) , for which the diff is length-1 (as needed) and NA (as needed).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM