简体   繁体   中英

What is the shortest and cleanest way to recode multiple variables in a dataframe using R?

So I'm working in the field of social science and what I often have to do is to manipulate multiple variables to change values. More often than not this means to reverse a scale. I've been working with SPSS for a long time and the syntax there is quite simple. To change the values of multiple variables you write:

RECODE var1 var2 var3 (1=5) (2=4) 4=2) (5=1) (ELSE=COPY).

To write the new codes in new variables you add into newvar1 newvar1 newvar3. at the end. In the brackets you can use things like hi , lo , 1 to 4 and so on.

Now I'm working my way into R and I'm struggeling to find the best way to do similar workflows. I found the following solutions, but can't get a short nice way:

## Packages -----
library(dplyr)
library(car)

## Data -----
tib <- tibble(v1 = 1:4, 
              v2 = 1:4,
              v3 = sample(1:5, 4, replace = FALSE))

vars <- c("v1", "v2", "v3")

The base way:

tib$v2_rec <- NA
tib$v2_rec[tib$v2 == 1] <- 5 #1
tib$v2_rec[tib$v2 == 2] <- 4 #2
tib$v2_rec[tib$v2 == 3] <- 3 #3
tib$v2_rec[tib$v2 == 4] <- 2 #4
tib$v2_rec[tib$v2 == 5] <- 1 #5
# I'm forced to create a new variable here, otherwise #4 and #5 overwrite #1 and #2.
# Therefore I won't even bother to try to loop trough multiple variables.

recode() from the package car:

tib$v1 <- recode(tib$v1, "1=5; 2=4; 4=2; 5=1")
# This is nice, understandable and short
# To handle multiple variables the following solutions won't work, because the reload functions seems not to be able to iterate through lists:

tib[vars] <- recode(tib[vars], "1=5; 2=4; 4=2; 5=1")
tib[1:3] <- recode(tib[1:3], "1=5; 2=4; 4=2; 5=1")

# I'd be forced to loop:

for (i in vars) {
  tib[[i]] <- recode(tib[[i]], "1=5; 2=4; 4=2; 5=1")
}

I'm pretty happy with that but I was wondering if there's a function that would do the work of looping for me. I'm realy struggling with the dplyer functions at the moment and I'm not happy how I can't figure things out intuitively...

I tried mutate:

#I get it for a single case and for multiple cases i got to a solution in combination with the recode() function:

tib <- tib %>%
  mutate_at(vars(v1:v3), 
            function(x) recode(x, "1=5; 2=4; 4=2; 5=1"))

Is this the best way to do this? Just to be clear, I saw some other solutions using case_when(), replace() or mapvalues() but I find the solution above better, because I like to see what value gets recoded to what value in one glimps.

I got a little into the apply() function and could not even recode one variable with it. I'm sure I'll get a grip on that as well soon, but at the moment I'm just a little frustrated how long I'm looking around for workflows that took me one line in SPSS. If you know any shorter and cleaner solution than the one above using the apply() function I would be greatful!

I'm happy with R and it's possibilities but right now I need a hint in the right direction to keep me going! Thank you in advance!

I think if used correctly, dplyr has the "cleanest" syntax in this case:

library(dplyr)
tib <- tibble(v1 = 1:4, 
              v2 = 1:4,
              v3 = sample(1:5, 4, replace = FALSE))

tib %>% 
  mutate_at(vars(v1:v3), recode, `1` = 5, `2` = 4, `3` = 3, `4` = 2, `5` = 1)
#> # A tibble: 4 x 3
#>      v1    v2    v3
#>   <dbl> <dbl> <dbl>
#> 1     5     5     2
#> 2     4     4     5
#> 3     3     3     4
#> 4     2     2     1

Note that I had to add 3 = 3 because recode needs a replacement for all values.

I often find it easier to write things more explicitly with functions that are new to me, so maybe this might help:

tib %>% 
  mutate_at(.vars = vars(v1:v3), 
            .funs = function(x) recode(x, 
                                       `1` = 5, 
                                       `2` = 4, 
                                       `3` = 3, 
                                       `4` = 2, 
                                       `5` = 1))

If you prefer the recode function from car you should not load car but use:

tib %>% 
  mutate_at(vars(v1:v3), car::recode, "1=5; 2=4; 4=2; 5=1")

That way you don't run into trouble mixing dplyr with car (as long as you don't need car for anything else.

Here is a simple way using only base functions. This assumes that these are 5-point likert items where the original coding was 1 - 5. If you had, say, 7-point likert items, or coded 0 - 4, or -2 - 2, you'd need to adapt this.

Some coding notes: You have a pseudorandom generation element to your dataset (the call to sample() ); to make the dataset exactly reproducible, use ?set.seed . You can automatically print a variable or dataset that has been assigned by enclosing it in parentheses when using the arrow assignment operator ( (var <- value) ). R is vectorized, so you don't need a loop (although it's really OK here--with so few variables it won't cause a noticeable slowdown).

set.seed(4636)  # this makes the example exactly reproducible
(d <- data.frame(v1 = 1:4, 
                 v2 = 1:4,
                 v3 = sample(1:5, 4, replace = FALSE)))  # adding outer ()'s prints
#   v1 v2 v3
# 1  1  1  1
# 2  2  2  2
# 3  3  3  5
# 4  4  4  4

d.orig <- d  # here's your original dataset, so they aren't overwritten
(d <- 6-d)  # adding outer ()'s prints
#   v1 v2 v3
# 1  5  5  5
# 2  4  4  4
# 3  3  3  1
# 4  2  2  2

rec.vars <- c("v2")
d.some   <- d.orig
(d.some[,rec.vars] <- 6-d.some[,rec.vars])
# [1] 5 4 3 2
d.some
#   v1 v2 v3
# 1  1  5  1
# 2  2  4  2
# 3  3  3  5
# 4  4  2  4

##### to do more than 1 variable
(rec.vars <- paste0("v", c(2,3)))
# [1] "v2" "v3"
d.some   <- d.orig
(d.some[,rec.vars] <- 6-d.some[,rec.vars])
#   v2 v3
# 1  5  5
# 2  4  4
# 3  3  1
# 4  2  2
d.some
#   v1 v2 v3
# 1  1  5  5
# 2  2  4  4
# 3  3  3  1
# 4  4  2  2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM