简体   繁体   中英

Add a column based on the values of other two columns in the same data frame in r

Suppose I have a data frame with three variables as the one bellow, I want to add a fourth variable whose values are based on the values on the second and third variable, eg. if var2 = var3 then var4 = 3, if var2 = Y and var3 = NA then var4 = 1 and, if var2 = NA and var3 = Y then var4 = 2.

var1 var2 var3
m01  Y    NA    
m02  Y    NA
m03  NA   Y
m04  NA   Y
m05  Y    Y
m06  Y    NA
m07  Y    Y

I would like to get a data frame like this:

var1 var2 var3 var4
m01  Y    NA   1
m02  Y    NA   1
m03  NA   Y    2
m04  NA   Y    2
m05  Y    Y    3
m06  Y    NA   1
m07  Y    Y    3

I am trying with ifelse but I haven't had success.

Any ideas?

Everyone forgets about poor old interaction :

c(3,2,1,4)[interaction(lapply(dat[-1], is.na))]
#[1] 1 1 2 2 3 1 3

Try this:

library(dplyr)
df <- data.frame(var1 = paste0("m0",1:7), 
             var2 = c(rep("Y",2) ,rep(NA, 2), rep("Y", 3)),
             var3 = c(rep(NA, 2), rep("Y", 3), NA, "Y"))
mutate(df, var4 = if_else(var2 ==  "Y", 
                      if_else(var3 == "Y", 3, 1,1), 
                      2, 2))

if_else from dplyr package will handle the case of missing number (NA) as well

A handful of options:

df <- read.table(text = 'var1 var2 var3
m01  Y    NA    
m02  Y    NA
m03  NA   Y
m04  NA   Y
m05  Y    Y
m06  Y    NA
m07  Y    Y', head = TRUE, stringsAsFactors = FALSE)

A typical base R approach would be to apply to iterate rowwise across the requisite columns. This is silently coercing to a matrix, which is why some avoid this approach.

apply(df[-1], 1, function(x){sum(which(x == 'Y'))})
#> [1] 1 1 2 2 3 1 3

You could translate it to dplyr with rowwise , which does not coerce to a matrix, but is not usually the fastest possible approach:

library(dplyr)

df %>% 
    rowwise() %>% 
    mutate(var4 = sum(which(c(var2, var3) == 'Y')))
#> Source: local data frame [7 x 4]
#> Groups: <by row>
#> 
#> # A tibble: 7 x 4
#>    var1  var2  var3  var4
#>   <chr> <chr> <chr> <int>
#> 1   m01     Y  <NA>     1
#> 2   m02     Y  <NA>     1
#> 3   m03  <NA>     Y     2
#> 4   m04  <NA>     Y     2
#> 5   m05     Y     Y     3
#> 6   m06     Y  <NA>     1
#> 7   m07     Y     Y     3

This also will fail as-is for factors (which get converted to integers by c ), but they can be coerced beforehand or internally, or you could use is.na instead of checking equality.

More creative base options include pasting the columns together to create a factor that can be deliberately leveled for coercion to integer:

as.integer(factor(paste0(df$var2, df$var3), levels = c('YNA', 'NAY', 'YY')))
#> [1] 1 1 2 2 3 1 3

or using do.call to pass a list of a function and each desired variable of df (flattened with c ) to mapply :

do.call(mapply, 
        c(function(...){sum(which(!is.na(c(...))))}, 
          df[-1], 
          USE.NAMES = FALSE))
#> [1] 1 1 2 2 3 1 3

If you really want the ifelse logic, dplyr::case_when lets you use cascading conditionals without the messy syntax:

df %>% mutate(var4 = case_when(var2 == 'Y' & var3 == 'Y' ~ 3,
                               var2 == 'Y' ~ 1, 
                               var3 == 'Y' ~ 2))
#>   var1 var2 var3 var4
#> 1  m01    Y <NA>    1
#> 2  m02    Y <NA>    1
#> 3  m03 <NA>    Y    2
#> 4  m04 <NA>    Y    2
#> 5  m05    Y    Y    3
#> 6  m06    Y <NA>    1
#> 7  m07    Y    Y    3

Using ifelse:

df$var4 <- ifelse(df$var2 == df$var3, 3, 
             ifelse(df$var3 == "NA" & df$var2 == "y", 1, 
               ifelse(df$var2 == "NA" & df$var3 == "y", 2, "?")))

works if "NA" are factor values. Otherwise replace df$var3 == "NA" with is.na(df$var3) and df$var2 == "NA" with is.na(df$var2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM