简体   繁体   中英

Create column from data on dynamic number of columns depending on availabity in R

Given a uncertain number of columns containing source values for the same variable I would like to create a column that defines the final value to be selected depending on source importance and availability.

Reproducible data:

  set.seed(123)
  actuals = runif(10, 500, 1000)
  get_rand_vector <- function(){return (runif(10, 0.95, 1.05))}
  get_na_rand_ixs <- function(){return (round(runif(5,0,10),0))}
  df = data.frame("source_1" = actuals*get_rand_vector(),
                  "source_2" = actuals*get_rand_vector(),
                  "source_n" = actuals*get_rand_vector())
  df[["source_1"]][get_na_rand_ixs()] <- NA
  df[["source_2"]][get_na_rand_ixs()] <- NA
  df[["source_n"]][get_na_rand_ixs()] <- NA

My manual solution is as follows:

  df$available <- ifelse(
    !is.na(df$source_1),
    df$source_1,
    ifelse(
      !is.na(df$source_2),
      df$source_2,
      df$source_n
    )
  )

Given the desired result of:

   source_1 source_2 source_n available
1        NA       NA       NA        NA
2        NA       NA 930.1242  930.1242
3  716.9981       NA 717.9234  716.9981
4        NA 988.0446       NA  988.0446
5  931.7081       NA 924.1101  931.7081
6  543.6802 533.6798       NA  543.6802
7  744.6525 767.4196 783.8004  744.6525
8  902.8788 955.1173       NA  902.8788
9  762.3690       NA 761.6135  762.3690
10 761.4092 702.6064 708.7615  761.4092

How could I automatically iterate over the available sources to set the data to be considered? Given in some cases n_sources could be 1,2,3..,7 and priority follows the natural order (1 > 2 >..)

Once you have all of the candidate vectors in order and in an appropriate data structure (eg, data.frame or matrix ), you can use apply to apply a function over the rows. In this case, we just look for the first non- NA value. Thus, after the first block of code above, you only need the following line:

df$available <- apply(df, 1, FUN = function(x) x[which(!is.na(x))[1]])

coalesce() from dplyr is designed for this:

library(dplyr)

df %>%
  mutate(available = coalesce(!!!.))

   source_1 source_2 source_n available
1        NA       NA       NA        NA
2        NA       NA 930.1242  930.1242
3  716.9981       NA 717.9234  716.9981
4        NA 988.0446       NA  988.0446
5  931.7081       NA 924.1101  931.7081
6  543.6802 533.6798       NA  543.6802
7  744.6525 767.4196 783.8004  744.6525
8  902.8788 955.1173       NA  902.8788
9  762.3690       NA 761.6135  762.3690
10 761.4092 702.6064 708.7615  761.4092

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM