简体   繁体   中英

Rearrange columns based on coverage of previous columns

I'm working on a test coverage analysis and I would like to rearrange a matrix so that the columns are ordered by number of "additional" test failures.

As an example I have a matrix with TRUE and FALSE where TRUE indicates a failure.

df <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE), .Dim = c(10L, 3L), .Dimnames = list(NULL, c("t1", "t2", "t3")))

t2 has the highest number of failures and should be the first column. t1 has the next highest but all its failures (per row) are covered by t2. t3 however has fewer failures but the last two failures are not covered by t2 thus should be the second column.

Desired column order based on fail coverage:

df <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE), .Dim = c(10L, 3L), .Dimnames = list(NULL, c("t2", "t3", "t1")))

I was able to get a count of "additional" fails per test using a for loop in conjunction with apply function but performance is really bad when there is a lot of columns and rows in the data set. I do however prefer to rearrange the column for further processing.

for (n in 2:ncol(out)) {
  idx <- which.max(apply(out, 2, sum, na.rm = T))
  col.list <- c(col.list, names(idx))
  val.list <- c(val.list, sum(out.2[ ,idx], na.rm = T))
  out[out[ ,idx] == T, ] <- F
  out <- out[ ,-idx]
}

Can anyone suggest a better approach to do this? Maybe not using a for loop?

Thanks.

Here's a somewhat similar approach to OP's but I hope it will perform slightly better (not tested though):

select_cols <- names(tail(sort(colSums(df)), 1)) # first col
for(i in seq_len(ncol(df)-1)) {
  remaining_cols <- setdiff(colnames(df), select_cols)
  idx <- rowSums(df[, select_cols, drop=FALSE]) > 0
  select_cols <- c(select_cols, 
                   names(tail(sort(colSums(df[!idx, remaining_cols, drop=FALSE])), 1)))
}
df <- df[, select_cols]
df

#        t2    t3    t1
# [1,]  TRUE FALSE  TRUE
# [2,]  TRUE FALSE  TRUE
# [3,]  TRUE FALSE  TRUE
# [4,]  TRUE FALSE  TRUE
# [5,]  TRUE FALSE  TRUE
# [6,]  TRUE FALSE  TRUE
# [7,]  TRUE FALSE FALSE
# [8,]  TRUE  TRUE FALSE
# [9,] FALSE  TRUE FALSE
# [10,] FALSE  TRUE FALSE

Update: try this slightly modified version - it is a lot faster and I think it will produce correct results:

  select_cols <- names(tail(sort(colSums(m)), 1)) # first col
  idx <- rowSums(m[, select_cols, drop = FALSE]) > 0
  for(i in seq_len(ncol(m)-1)) {
    remaining_cols <- setdiff(colnames(m), select_cols)
    idx[!idx] <- rowSums(m[!idx, select_cols, drop=FALSE]) > 0
    select_cols <- c(select_cols, 
                     names(tail(sort(colSums(m[!idx, remaining_cols, drop=FALSE])), 1)))
  }
  m <- m[, select_cols]
  m

The main difference between the two is this line:

idx[!idx] <- rowSums(m[!idx, select_cols, drop=FALSE]) > 0

which means we don't need to compute rowSums for rows where any previously selected column is already true.

Here my solution which is based on a shortcut.

df <- as.data.frame(df)
df_new <- df
index <- NULL
for (i in 1:dim(df)[2]) {
  var <- names(sort(apply(X=df, MARGIN=2, sum), decreasing = T))[1]
  index = c(index, var)
  df<-df[df[,var]==F,]
}
df_new[,c(index)]

If only new failure counts we can iterate a loop by:

  1. take the variable with more failures
  2. remove data where previous variable had failures
  3. retake another variable with more failures.

Step 2 allows to makes the loop faster, steps 1 and 3 are based on apply.

Hope it helps!

Here's an alternative working with data in long format instead. I use data.table functions, but it could be adapted to base if desired. I hope I understood your logic correctly ;) At least I try to explain my understanding in the commented code.

# convert matrix to data.table
dt <- as.data.table(df)

# add row index, 'ci'
dt[ , ri := 1:.N]

# melt to long format
d <- melt(dt, id.vars = "ri", variable.factor = FALSE, variable.name = "ci")

# determine first column
# for each 'ci' (columns in 'df'), count number of TRUE
# select 'ci' with max count
first_col <- d[ , sum(value), by = ci][which.max(V1), ci]

# for each 'ri' (rows in 'df'),
# check if number of unique 'ci' is one (i.e. "additional" test failures)    
d[(value), new := uniqueN(ci) == 1, by = ri]

# select rows where 'new' is TRUE
# for each 'ci', count the number of rows, i.e the number of 'new'
# -> number of rows in 'df' where this column is the only TRUE
d_new <- d[(new), .(n_new = .N), ci]

# set order to descending 'n_new'
setorder(d_new, -n_new)

# combine first column and columns which contribute with additional TRUE
cols <- c(first_col, setdiff(d_new[ , ci], first_col)) 

# set column order. 
# First 'cols', then any columns which haven't contributed with new values
# (none in the test data, but needed for more general cases)  
setcolorder(dt, c(cols, setdiff(names(dt), cols)))

dt
#        t2    t3    t1 ri
#  1:  TRUE FALSE  TRUE  1
#  2:  TRUE FALSE  TRUE  2
#  3:  TRUE FALSE  TRUE  3
#  4:  TRUE FALSE  TRUE  4
#  5:  TRUE FALSE  TRUE  5
#  6:  TRUE FALSE  TRUE  6
#  7:  TRUE FALSE FALSE  7
#  8:  TRUE  TRUE FALSE  8
#  9: FALSE  TRUE FALSE  9
# 10: FALSE  TRUE FALSE 10

Tried it on a matrix of the size mentioned in comment :

set.seed(1)
nr <- 14000
nc <- 1400
df <- matrix(sample(c(TRUE, FALSE), nr*nc, replace = TRUE), nr, nc,
             dimnames = list(NULL, paste0("t", 1:nc)))

Finished in < 5 seconds.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM