简体   繁体   中英

select best comb among all combination data.table r

Suppose there are a few firm combination result in certain best value, how to efficiently select unique best combination that every firm appear only once the data.table way?

The sample data:

require(data.table)
set.seed(1234)
allcombs <- data.table(val=sample(1:20,15), t(combn(LETTERS[1:6], 2)))
setnames(allcombs, paste0("V",1:2), paste0("firm",1:2))
copy_sets = copy(allcombs)

allcombs
    val firm1 firm2
 1:  16     A     B
 2:   5     A     C
 3:  12     A     D
 4:  15     A     E
 5:   9     A     F
 6:  19     B     C
 7:   6     B     D
 8:   4     B     E
 9:   2     B     F
10:   7     C     D
11:  14     C     E
12:  10     C     F
13:  11     D     E
14:  20     D     F
15:  13     E     F

I can do this with a loop:

all_elements = unique(c(allcombs$firm1, allcombs$firm2))
selected_pairs = data.table()
while (length(all_elements) > 0){
  selected_pairs <- rbind(selected_pairs, allcombs[allcombs[firm1 == all_elements[1] | firm2 == all_elements[1], .I[which.max(val)]]])
  all_elements <- setdiff(all_elements, unlist(allcombs[allcombs[firm1 == all_elements[1] | firm2 == all_elements[1], .I[which.max(val)]],.(firm1,firm2)]))
  allcombs <- allcombs[firm1 %in% all_elements & firm2 %in% all_elements]
}

Here is what I wanted:

selected_pairs
   val firm1 firm2
1:  16     A     B
2:  14     C     E
3:  20     D     F

Any help appreciated!

If I understand correctly, the OP wants to select the unique best combinations where every firm appears only once.

The code below picks the combination with the highest val , marks all other rows which contain firm1 or firm2 as done and iteratively continues with the remaining rows until all rows are done . Bookeeping is done by updating the rank column by reference , ie, without copying.

d <- copy(allcombs)
setorder(d, -val)
d[, rank := NA_integer_]
r = 0L
remain <- d[, .I]
while (length(remain) > 0) {
  r <- r + 1L
  idx <- remain[d[remain, which.max(val)]]
  d[idx, rank := r]
  lut <- d[idx, .(firm = c(firm1, firm2), rank = NA_integer_)]
  d[lut, on = c("firm1==firm", "rank"), rank := 0]
  d[lut, on = c("firm2==firm", "rank"), rank := 0]
  remain <- d[, .I[is.na(rank)]]
}
d[rank > 0]
 val firm1 firm2 rank <int> <char> <char> <int> 1: 20 DF 1 2: 19 B C 2 3: 15 AE 3

Note that the result here differs from OP's result as the data.table is ordered by decreasing val while OP's code iterates over the rows in order in which the company names appear in firm1 and firm2 .

I find this arbitrary and not conclusive. OP's approach will select only suboptima within the combinations of the current firm1 instance but not the overall optimum of all the remaining rows.


Edit 2

Here is a simplified version of above code which uses an additional row id rn column instead of the remain vector:

d <- copy(allcombs)
d[, rank := NA_integer_] # append bookkeeping column
d[, rn := .I] # append row id
r = 0L
while (any(is.na(d$rank))) {
  r <- r + 1L
  idx <- d[is.na(rank), rn[which.max(val)]]
  d[idx, rank := r]
  lut <- d[idx, .(firm = c(firm1, firm2), rank = NA_integer_)]
  d[lut, on = c("firm1==firm", "rank"), rank := 0L]
  d[lut, on = c("firm2==firm", "rank"), rank := 0L]
}
d[rank > 0]

Edit 1

Out of curiosity I have tried to reproduce OP's expected result. So, here is a variant of the code above which loops over the unique company names:

d <- copy(allcombs)
firms <- d[, unique(c(firm1, firm2))]
# firms <- rev(d[, unique(c(firm1, firm2))])
d[, rank := NA_integer_]
d[, rn := .I] # append row id
r = 0L
for (f in firms) {
  r <- r + 1L
  idx <- d[is.na(rank) & (firm1 == f | firm2 == f), rn[which.max(val)]]
  d[idx, rank := r]
  lut <- d[idx, .(firm = c(firm1, firm2), rank = NA_integer_)]
  d[lut, on = c("firm1==firm", "rank"), rank := 0L]
  d[lut, on = c("firm2==firm", "rank"), rank := 0L]
  if (!any(is.na(d$rank))) break
}
d[rank > 0]

     val  firm1  firm2  rank    rn
1:    16      A      B     1     1
2:    14      C      E     3    11
3:    20      D      F     4    14

As mentioned above, the result may depend on the order of appearance of company names in firm1 and firm2 .

This can be demonstrated by reversing the order of company names by

firms <- rev(d[, unique(c(firm1, firm2))])

Now, the code returns

     val  firm1  firm2  rank    rn
1:    15      A      E     2     4
2:    19      B      C     4     6
3:    20      D      F     1    14

The bookkeeping columns have not been removed for demonstration.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM