简体   繁体   中英

New data.table column returns nth largest value in a row

My data looks similar to this:

set.seed(1)
dt <- data.table(rank=c(3,4,2,1),`1`=rnorm(4),`2`=rnorm(4),`3`=rnorm(4),`4`=rnorm(4),`5`=rnorm(4),`6`=rnorm(4))

   rank          1          2          3           4           5           6
1:    3 -0.6264538  0.3295078  0.5757814 -0.62124058 -0.01619026  0.91897737
2:    4  0.1836433 -0.8204684 -0.3053884 -2.21469989  0.94383621  0.78213630
3:    2 -0.8356286  0.4874291  1.5117812  1.12493092  0.82122120  0.07456498
4:    1  1.5952808  0.7383247  0.3898432 -0.04493361  0.59390132 -1.98935170

I would like to add a new column rank_match that finds the nth (taken from the rank column) largest value in the row from columns named 1 to 6 . For instance, the first line would look for the 3rd largest value in the row from columns named 1 to 6 which is 0.3295078.

Something like these (but of course they doesn't work):

dt[,rank_match := (sort(`1`:`6`, decreasing = TRUE)[rank])]
dt[,rank_match := (sort(.SD, decreasing = TRUE)[rank]), .SDcols=`1`:`6`]

The output should look similar to this:

   rank          1          2          3           4           5           6 rank_match
1:    3 -0.6264538  0.3295078  0.5757814 -0.62124058 -0.01619026  0.91897737  0.3295078
2:    4  0.1836433 -0.8204684 -0.3053884 -2.21469989  0.94383621  0.78213630 -0.3053884
3:    2 -0.8356286  0.4874291  1.5117812  1.12493092  0.82122120  0.07456498  1.1249309
4:    1  1.5952808  0.7383247  0.3898432 -0.04493361  0.59390132 -1.98935170  1.5952808

Thanks so much.

One option is group by the sequence of rows, specify the columns of interest from column 2 onwards, unlist , the Subset of Data.table, sort in decreasing order, subset the value based on the 'rank' column and assign it to 'rank_match'

dt[, rank_match := sort(unlist(.SD), decreasing = TRUE)[rank], 
           1:nrow(dt), .SDcols = 2:ncol(dt) ]
dt
#   rank          1          2          3           4           5           6 rank_match
#1:    3 -0.6264538  0.3295078  0.5757814 -0.62124058 -0.01619026  0.91897737  0.3295078
#2:    4  0.1836433 -0.8204684 -0.3053884 -2.21469989  0.94383621  0.78213630 -0.3053884
#3:    2 -0.8356286  0.4874291  1.5117812  1.12493092  0.82122120  0.07456498  1.1249309
#4:    1  1.5952808  0.7383247  0.3898432 -0.04493361  0.59390132 -1.98935170  1.5952808

Another option would be to melt and then get the corresponding value of the 'value' column

out <- melt(dt, id.var = c('rn', 'rank'))[order(-value), 
                  value[rank[1]] , .(rn)][order(rn)]$V1
dt[, rank_match := out][, rn := NULL][]

Or a compact approach suggested by @IceCreamToucan

dt[, rank_match := melt(.SD, 'rank')[, value[order(-value)[rank]], rank]$V1]

Or use pmap (from purrr ) to loop through the rows

library(purrr)
dt[, rank_match := pmap_dbl(.SD, ~ c(...) %>% 
                                    {sort(-.[-1])[.[1]]})]

apply the indicated function on each row of .SD :

dt[, rank_match := apply(.SD, 1, function(x) -sort(-x[-1])[x[1]])]

giving:

   rank          1          2          3           4           5           6 rank_match
1:    3 -0.6264538  0.3295078  0.5757814 -0.62124058 -0.01619026  0.91897737  0.3295078
2:    4  0.1836433 -0.8204684 -0.3053884 -2.21469989  0.94383621  0.78213630 -0.3053884
3:    2 -0.8356286  0.4874291  1.5117812  1.12493092  0.82122120  0.07456498  1.1249309
4:    1  1.5952808  0.7383247  0.3898432 -0.04493361  0.59390132 -1.98935170  1.5952808
dt[, rank_match := apply(.SD, 1, function(x) x[order(-x)][rank]), by = rank, .SDcols = `1`:`6`]
dt
   rank          1          2          3           4           5           6 rank_match
1:    3 -0.6264538  0.3295078  0.5757814 -0.62124058 -0.01619026  0.91897737  0.3295078
2:    4  0.1836433 -0.8204684 -0.3053884 -2.21469989  0.94383621  0.78213630 -0.3053884
3:    2 -0.8356286  0.4874291  1.5117812  1.12493092  0.82122120  0.07456498  1.1249309
4:    1  1.5952808  0.7383247  0.3898432 -0.04493361  0.59390132 -1.98935170  1.5952808

DescTools::Large returns the n th largest elements from a vector without sorting the whole thing. Not sure how this compares to dt[order(-value)[rank], ...] .

library(DescTools)
library(data.table)

dt[, rank_match := melt(dt, 'rank')[, Large(value, rank)[1], rank]$V1]


#    rank          1          2          3           4           5           6 rank_match
# 1:    3 -0.6264538  0.3295078  0.5757814 -0.62124058 -0.01619026  0.91897737  0.3295078
# 2:    4  0.1836433 -0.8204684 -0.3053884 -2.21469989  0.94383621  0.78213630 -0.3053884
# 3:    2 -0.8356286  0.4874291  1.5117812  1.12493092  0.82122120  0.07456498  1.1249309
# 4:    1  1.5952808  0.7383247  0.3898432 -0.04493361  0.59390132 -1.98935170  1.5952808

Note: If some rows have the same rank you must use the rn / row number logic as in akrun's answer.

An alternative implementation (with two variants):

# option 1
dt[melt(dt, id = 1)[, value[frank(-value) == .BY], by = rank]
   , on = .(rank)
   , rank_match := V1 ]

# option 2
dt[, rank_match := melt(dt, id = 1)[, value[frank(-value) == .BY], by = rank]$V1 ]

which both give the desired result:

 > dt rank 1 2 3 4 5 6 rank_match 1: 3 -0.6264538 0.3295078 0.5757814 -0.62124058 -0.01619026 0.91897737 0.3295078 2: 4 0.1836433 -0.8204684 -0.3053884 -2.21469989 0.94383621 0.78213630 -0.3053884 3: 2 -0.8356286 0.4874291 1.5117812 1.12493092 0.82122120 0.07456498 1.1249309 4: 1 1.5952808 0.7383247 0.3898432 -0.04493361 0.59390132 -1.98935170 1.5952808 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM