简体   繁体   中英

Tidying in R: how to collapse my binary columns into characters, based on vectors?

I am tidying my data in R, and want to turn multiple columns into 1, using a function iterating over the items of a vector. I was wondering whether you could help me out to:

  • work away a semantic error,
  • and make my code more efficient?

My data is based on a survey with 32 questions. Each question has multiple answers. Each answer is a column, with options 1 and NA.

For one question, a section of the dataset can be reproduced as follows:

XV2_1 <- c(1,NA,NA,NA)
XV2_2 <- c(NA,1,NA,NA)
XV2_3 <- c(NA,NA,NA,1)
XV2_4 <- c(NA,NA,1,NA)
id <- c(12,13,14,15)

dat <- data.frame(id,XV2_1, XV2_2, XV2_3,XV2_4)

> dat
  id XV2_1 XV2_2 XV2_3 XV2_4
1 12     1    NA    NA    NA
2 13    NA     1    NA    NA
3 14    NA    NA    NA     1
4 15    NA    NA     1    NA

This is the data I would like to have (

question_2_answers <- c("Yellow","Blue","Green","Orange") #this is a vector based on the answers of the questionnaire

collapsed <- c("Yellow","Blue","Orange","Green")

collapsed_dataframe <- data.frame(id,collapsed)
>collapsed_dataframe
  id   X2
1 12   Yellow
2 13   Blue
3 14   Green
4 15   Orange

So far, I tried a sequence of "ifelse's" combined with mutate:

library(tidyverse)
question_2_answers <- c("Yellow","Blue","Green","Orange") #this is a vector based on the answers of the questionnaire

dat %>%
  mutate(
    Colour = tidy_Q2(question_2_answers,XV2_1,XV2_2,XV2_3,XV2_4)
  )

tidy_Q2 <- function(a,b,c,d,e) {
  ifelse(b == 1, a[1],ifelse(
    c==1,a[2],ifelse(
      d==1,a[3],a[4])))
}

However, my output is not as expected:

  id XV2_1 XV2_2 XV2_3 XV2_4 Colour
1 12     1    NA    NA    NA Yellow
2 13    NA     1    NA    NA   <NA>
3 14    NA    NA    NA     1   <NA>
4 15    NA    NA     1    NA   <NA>

I would have liked it to be as follows:

  id XV2_1 XV2_2 XV2_3 XV2_4 Colour
1 12     1    NA    NA    NA Yellow
2 13    NA     1    NA    NA   Blue
3 14    NA    NA    NA     1   Green
4 15    NA    NA     1    NA   Orange

Does anyone know a way to remove the error? Another question that I'd like to ask, is whether my code can be more efficient? I have 32 survey_questions in store after this, I'd like to automate the process as much as possible. Notable things to take in mind:

  • not all survey questions have the same amount of options (ie question 2 has 2 options and therefore 2 columns, whilst question 10 has 8 options and 8 columns)
  • some values are strings, instead of 1 or NA

Always happy to learn,

Best,

Maria

This is a kind of wide-to-long conversion which we can do with tidyr::gather :

First, we make the colors the column names of the appropriate rows:

# Replace column names (except for the `id` column) with color values
colnames(dat)[-1] <- c("Yellow","Blue","Orange","Green")

dat
  id Yellow Blue Orange Green
1 12      1   NA     NA    NA
2 13     NA    1     NA    NA
3 14     NA   NA     NA     1
4 15     NA   NA      1    NA

Then, we gather the non-id columns and drop the NA values:

library(tidyverse)
dat %>%
    gather(X2, val, -id) %>%   # Gather color cols from wide to long format
    filter(!is.na(val)) %>%    # Drop rows with NA values
    select(-val)               # Remove the unnecessary `val` column

  id     X2
1 12 Yellow
2 13   Blue
3 15 Orange
4 14  Green

This will work with any number of columns (you just need to specify all columns you don't want to gather) and keeps rows with non- NA values. If you want other conditions to exclude a row (for example, if 0 or 'unknown' should count as a non-answer, or only 'correct' counts as an answer) then you should add those conditions to the filter statement.

One option in base R would be max.col is to find the column index of values that are not NA in each row, use that to get the column names corresponding to the index, create a 2 column data.frame by cbind ing with the first column

i1 <- max.col(!is.na(dat[-1]), 'first')
cbind(dat['id'], Colour = names(dat)[-1][i1])
#  id Colour
#1 12 Yellow
#2 13   Blue
#3 14  Green
#4 15 Orange

data

dat <-  structure(list(id = c(12, 13, 14, 15), Yellow = c(1, NA, NA, 
NA), Blue = c(NA, 1, NA, NA), Orange = c(NA, NA, NA, 1), Green = c(NA, 
NA, 1, NA)), class = "data.frame", row.names = c(NA, -4L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM