简体   繁体   中英

Assign new column values based on multiple conditions in R

I need to assign a new column, with multiple possible values based on multiple conditions. Example Data

a1 a2 a3 a4 a5 a6 a7 a8 a9 
NA 1  NA 2  7  8  9  1  1 
7  7  7  7  7  7  7  7  7
6  6  6  6  6  6  5  5  5

So I might have rules for example: if a1 to a9 contain 1 or 2 then return 1, otherwise, return 7. or if a1 to 19 contain 5 or 6, return a 6, otherwise 3. I have a number of these rules so need something that could accommodate.

Required outcome

a1 a2 a3 a4 a5 a6 a7 a8 a9 NEW
NA 1  NA 2  7  8  9  1  1  1
7  7  7  7  7  7  7  7  7  7
6  6  6  6  6  6  5  5  5  6

I have tried assigning with subsetting ie

df$NEW <- 7
df$NEW[df$a1==1 | df$a2==1 | df$a3==1] <- 1
df$NEW[df$a4==1 | df$a5==1 | df$a6==1] <- 1
df$NEW[df$a7==1 | df$a8==1 | df$a9==1] <- 1
df$NEW[df$a1==7 | df$a2==7 | df$a3==7] <- 7
df$NEW[df$a1==5 | df$a2==5 | df$a3==5] <- 6
df$NEW[df$a1==6 | df$a2==6 | df$a3==6] <- 6

Which I'm aware is clunky, but works to a point. Once there are multiple values / conditions however, not all values are filled correctly (returns maybe 2 out of 3+ desired / assigned values). FOr the 'otherwise' rule I have used != as well as > or < . I've also attempted using ifelse but with the same effect.

I'm also aware the solution is going to be relatively simple and staring me in the face but I'd be grateful for you to signpost me to a reasonable solution.

If there's anything you want me to clarify, just let me know.

Thanks in advance.

There is a vectorised if statement in dplyr that can help you called case_when :

library(dplyr)

df <- read.table(text = 'a1 a2 a3 a4 a5 a6 a7 a8 a9 
           NA 1  NA 2  7  8  9  1  1 
           7  7  7  7  7  7  7  7  7
           6  6  6  6  6  6  5  5  5', header = T)

df %>% 
  mutate(
    NEW = case_when(
      a1 == 1 | a2 == 1 | a3 == 1 ~ 1,
      a1==1 | a2==1 | a3==1 ~ 1,
      a4==1 | a5==1 | a6==1 ~ 1,
      a7==1 | a8==1 | a9==1 ~ 1,
      a1==7 | a2==7 | a3==7 ~ 7,
      a1==5 | a2==5 | a3==5 ~ 6,
      a1==6 | a2==6 | a3==6 ~ 6
    )
  )

The conditions are placed on the left hand side of ~ and the result you want on the right hand side.

Returns:

  a1 a2 a3 a4 a5 a6 a7 a8 a9 NEW
1 NA  1 NA  2  7  8  9  1  1   1
2  7  7  7  7  7  7  7  7  7   7
3  6  6  6  6  6  6  5  5  5   6

Here's an idea which works with multiple rules. But your example is not clear, what's happen in a line without 1,2,5 and 6 ? 7 or 3 ?

Anyway, here an idea adaptable based on: 1 or 2 -> 1 ; 5 or 6 -> 6 (supposed 1 or 2 and 5 or 6 can not be mixed) ; otherwise -> 7

df$new <- 7

for (i in 1:nrow(df)) {
  if (1 %in% as.numeric(df[i,]) | 2 %in% as.numeric(df[i,] )) {

    df[i,]$new <- 1
  } 
  else if (5 %in% as.numeric(df[i,]) | 6 %in% as.numeric(df[i,] )) {
    df[i,]$new <- 6
  }
}


df

You could use apply function instead of the loop

Here you go... everything should be well explained in that (base r) loop. You would only need to spend some time creating a coefficients file in order to generalize this to other data. You would also have to tweak a bit when your conditions will change ( & instead of |, < instead of = etc.)

df <-data.frame(matrix(c(NA, 1,  NA, 2,  7,  8,  9,  1,  1,7,  7,  7,  7,  7,  7,  7,  7,  7,6,  6,  6,  6,  6,  6,  5,  5,  5),
                        nrow=3, ncol=9, byrow=T))
colnames(df) = c("a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8", "a9" )
nbconditions <- 6
condition <- matrix(NA, nrow=nrow(df) , ncol= nbconditions)
# you could read.xlsx an already prepared coefficient matrix here
coefficients <-  matrix(NA, nrow= ncol(df)  , ncol=nbconditions )
coefficients[c(1,2,3),1] <- 1
coefficients[c(4,5,6),2] <- 1
coefficients[c(7,8,9),3] <- 1
coefficients[c(1,2,3),4] <- 7
coefficients[c(1,2,3),5] <- 5
coefficients[c(1,2,3),6] <- 6
results <- c(1,1,1,7,6,6)
NEW <- rep(NA, nrow(df))

for(i in 1:nrow(df)) {
  found <- F
  for(j in nbconditions:1) {  #condition checking from least priority to most priority
    if(!found) {
      indicestocheck <- which(!is.na(coefficients[,j]))
      if(sum(is.na(df[i,indicestocheck]))==length(indicestocheck)) {
        NEW[i] <- NA 
      } else {
        checks <- (coefficients[,j] == df[i,indicestocheck])
        #print(checks)
        if( sum(is.na(checks)) < length(checks) & 1<=sum(checks[which(!is.na(checks))])) {
         NEW[i] <- results[j] 
         found <- T
         print(paste(j,"found",results[i]))
         }
      }
    }
  }
}
df$NEW <- NEW
df

> df
  a1 a2 a3 a4 a5 a6 a7 a8 a9 NEW
1 NA  1 NA  2  7  8  9  1  1   1
2  7  7  7  7  7  7  7  7  7   7
3  6  6  6  6  6  6  5  5  5   6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM