简体   繁体   中英

How to use apply or dplyr to conditionally transform specific variables

I'm working with 3 data frames with similar structure but different values. I'd like to transform/mutate specific variables based on a condition in the variable to be transformed AND a second variable in the data set. Other variables in the data set should be left intact.

In my example data, I'd like to transform columns VAR1-3 to NA IF the corresponding AGE < 65 AND if the column it self has value 0.

foo <- data.frame('AGE'=c(50,65,66,40,70,25,65,67,44,56), 'SMOKING'=c(0,0,0,0,0,1,1,1,1,1),
              'VAR1'=c(1,0,0,1,0,1,0,1,0,0),'VAR2'=c(0,0,1,0,0,1,0,0,0,1),'VAR3'=c(1,0,1,1,1,0,0,0,1,0))

VARv <- c('VAR1','VAR2','VAR3')
OTHERSv <- c('SMOKING')
AGEVARv <- c('AGE', VARv)

As my data sets are large (>2000 variables) and the variables may be in different order, I want to use variable names saved in vectors.

I can do this with the following for loop but would like to learn how to use either dplyr or apply functions

for (i in 1: length(VARv)) {foo[,VARv[i]] <- replace(foo[VARv[i]], foo[VARv[i]]==0 & foo$AGE<65, NA)}

If I wouldn't have binary variable SMOKING in the data set, I could do

foo <- apply(foo, 2,function(y) {
foo[foo==0 & foo$AGE < 65] <- NA
return(foo)
}) 

But this would also transform SMOKING variable.

Question: How do I select and refer to variables in apply function when one of them I want to refer by name and others I want to process automatically?

I have something like this in mind but how do I refer to the variable AGE correctly? This attempt produces 21 columns worth of data with correct NA action but repeats all columns for each of the column (AGE.SMOKING, AGE.AGE, AGE.VAR1..., VAR1.SMOKING, VAR1.AGE, VAR1.VAR1 ETC)

b <- data.frame(foo[colnames(foo) %in% OTHERSv], apply(foo[colnames(foo) %in% AGEVARv],2,function(y) {
foo[foo==0 & foo$AGE < 65] <- NA
return(foo)
}))

I'd appreciate any insight!

We can create a function for reuse

library(dplyr)
f1 <- function(dat, varCols, AgeCol){
  Age <- rlang::sym(AgeCol)
  dat %>%
     mutate_at(vars(varCols), funs(replace(., .==0 & (!!Age) < 65, NA)))
}

AgeC <- 'AGE'

f1(foo, VARv, AgeC)
#   AGE SMOKING VAR1 VAR2 VAR3
#1   50       0    1   NA    1
#2   65       0    0    0    0
#3   66       0    0    1    1
#4   40       0    1   NA    1
#5   70       0    0    0    1
#6   25       1    1    1   NA
#7   65       1    0    0    0
#8   67       1    1    0    0
#9   44       1   NA   NA    1
#10  56       1   NA    1   NA

We can also use base R methods

f2 <- function(dat, varCols, AgeCol){
    dat[varCols] <- (NA^(dat[[AgeCol]] < 65 & !dat[varCols]))*dat[varCols]

   dat
}

all.equal(f1(foo, VARv, AgeC), f2(foo, VARv, AgeC), check.attributes = FALSE)
#[1] TRUE

You might consider using case_when() .

In my example data, I'd like to transform columns VAR1-3 to NA IF the corresponding AGE < 65 AND if the column it self has value 0.

Here's an example of case_when() solving this problem:

library(tidyverse)

foo %>% 
  as_tibble() %>% 
  mutate(VAR1 = case_when(AGE < 65 & VAR1 == 0 ~ "NA",
                          TRUE ~ as.character(.$VAR1)),
         VAR2 = case_when(AGE < 65 & VAR2 == 0 ~ "NA",
                          TRUE ~ as.character(.$VAR2)),
         VAR3 = case_when(AGE < 65 & VAR3 == 0 ~ "NA",
                          TRUE ~ as.character(.$VAR3)))

Which returns:

# A tibble: 10 x 5
     AGE SMOKING  VAR1  VAR2  VAR3
   <dbl>   <dbl> <chr> <chr> <chr>
 1    50       0     1    NA     1
 2    65       0     0     0     0
 3    66       0     0     1     1
 4    40       0     1    NA     1
 5    70       0     0     0     1
 6    25       1     1     1    NA
 7    65       1     0     0     0
 8    67       1     1     0     0
 9    44       1    NA    NA     1
10    56       1    NA     1    NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM