简体   繁体   中英

Creating a new column based on values obtained from different column, using mutate() and case_when function in R

I am a student relatively new to R and have learnt a lot from browsing here, I have been stuck on something recently which after hours of trying still haven't been able to figure out what to do. Let's propose the following data set:

ID Y1 Y2 Y3 Y4

1 0 0 1 1

2 0 0 0 0

3 NA NA NA NA

I want to create a new column where it is filled based upon the following the conditions:

  1. If the row contains 1, return 1 regardless of NA or 0
  2. If it contains a mix of 0 and NA but not 1, return 0
  3. If it only contains NA, return NA

So using the example above I wanted to get the following:

ID Y1 Y2 Y3 Y4 Outcome

1 0 0 1 1 1

2 0 0 0 0 0

3 NA NA NA NA NA

However, the code I tried:

Data2 <- Data %>% mutate(Outcome = case_when( 
                                Data$Y1 == "na" &
                                Data$Y2 == "na" &
                                Data$Y3 == "na" &
                                Data$Y4 == "na" ~ "na"))  %>%                                
          mutate(Outcome = case_when(Data$Y1 == 1 ~ "1", 
                                 Data$Y2 == 1 ~ "1", 
                                 Data$Y3 == 1 ~ "1",
                                 Data$Y4 == 1 ~ "1",
                                 TRUE ~ "No"))

will return with:

ID Y1 Y2 Y3 Y4 Outcome

1 0 0 1 1 1

2 0 0 0 0 0

3 NA NA NA NA 0

which seems to ignore condition 3 where if it only contains na, return na.

Any pointers as to what I done wrong would be greatly appreciated.

Please forgive the formatting, I'm not sure how I could make it prettier as this is the first time I asked a question here.

Many thanks in advance!

[Edit] Thanks to Shah I noticed that there is potential for confusion, for that I apologise. I need give some clarification that this is just a segment of the data set to get the point across. I'm dealing with a big dataset which contains more columns, some of which also have numeric values.

Checking for each column ( Y1 , Y2 , Y3 etc) is too tedious and not scalable. It becomes a big problem if you have 100 columns where you need this.

As showed in example you want to ignore the 1st column ( ID ) and include all other columns in the calculation you can do the following. -1 in the answer is to ignore the 1st column ID .

Also use is.na to compare the NA values.

#Count number of non-NA values, this is used later to change the rows
#with all NA values to NA in outcome
non_NA <- rowSums(!is.na(df[-1]))
#Assign 1 if the count of 1 is greater than 0 in a row
df$Outcome <- as.integer(rowSums(df[-1], na.rm = TRUE) > 0)
#turn the outcome variable to NA for rows which has all NA values. 
df$Outcome[non_NA == 0] <- NA
df
#  ID Y1 Y2 Y3 Y4 Outcome
#1  1  0  0  1  1       1
#2  2  0  0  0  0       0
#3  3 NA NA NA NA      NA

data

df <- structure(list(ID = 1:3, Y1 = c(0L, 0L, NA), Y2 = c(0L, 0L, NA
), Y3 = c(1L, 0L, NA), Y4 = c(1L, 0L, NA)), 
class = "data.frame", row.names = c(NA, -3L))

You can try this using dplyr rowwise function which treat each row separately

library(dplyr)

df |> rowwise() |> 
mutate(Outcome = case_when(any(c_across(Y1:Y4) == 1) ~ "1" ,
 all(is.na(c_across(Y1:Y4))) ~ NA_character_ , TRUE ~ "0"))

  • output
# A tibble: 3 × 6
# Rowwise: 
     ID    Y1    Y2    Y3    Y4 Outcome
  <int> <int> <int> <int> <int> <chr>  
1     1     0     0     1     1 1      
2     2     0     0     0     0 0      
3     3    NA    NA    NA    NA NA     

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM