Separate string into different columns R

Question

I have data from a questionnaire. One of the questions is multiple choice and includes the option "other", which the user can write something else. I receive an Excel file with one column for that specific question, each option selected separated by a semi-colon. Example of the dataset below:

ID  Prob_saude
1   "Não tenho nenhum dos problemas de saúde indicados;" 
2   " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);"
3   " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);"
4   "Doença autoimmune;" 
5   " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;"  
6    "HIV;"
7    "Não tenho nenhum dos problemas de saúde indicados;" 
8    "Cardiológica;"

I want to create a column for each disease with yes/no in case the user has selected that option. Then I want to create another column with the option other. In this case, the options available were:

disease <- c(" Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);",
         "Hipertensão arterial (tensão arterial alta);", "Doença autoimmune;"
         "Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);",
         "Não tenho nenhum dos problemas de saúde indicados;")

The desired output would be this:

ID  Prob_saude_1 Prob_saude_2 Prob_saude_3 Prob_saude_4 Prob_saude_5 Prob_saude_6 Prob_saude
1         1           1            1            1            2          NA        "Não tenho nenhum dos problemas de saúde indicados;" 
2         2           1            1            1            1          NA        " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);"
3         2           2            1            2            1          NA        " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);"
4        1           1            2            1            1          NA        "Doença autoimmune;" 
5        2           2            1            1            1        "Diabetes;" " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;"  
6        1           1            1            1            1          "HIV;"      "HIV;"
7        1           1            1            1            2          NA           "Não tenho nenhum dos problemas de saúde indicados;" 
8        1           1            1            1            1       "Cardiológica;" "Cardiológica;"

I'm able to create the extra columns based on the option, but when I try to create the column for other the output given is equal to the column Prob_saude, so it doesn't exclude the options already selected. Any ideas? This is what I have so far. Feel free to give any suggestions, if you think there's a better way to achieve this.

dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]

for (index in 1:length(disease)) {
    rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
    dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
    dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := gsub(disease[index], "", dataset$Prob_saude, fixed = T)]
}

Answer 1

One way to handle this situation would be to combine the items listed as "other" in the list of disease types. Given the data, there are 5 disease types in the original disease vector, and three new ones from the questionnaires.

First, after some clean up we read the data posted with the question.

textFile <- "id|response
1|Não tenho nenhum dos problemas de saúde indicados; 
2| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);
3| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);
4|Doença autoimmune; 
5| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;  
6|HIV;
7|Não tenho nenhum dos problemas de saúde indicados; 
8|Cardiológica; "

data <- read.csv(text = textFile,sep = "|",
                 header = TRUE, stringsAsFactors = FALSE)
disease <- c("Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica)",
             "Hipertensão arterial (tensão arterial alta)", 
             "Doença autoimmune",
             "Problemas renais crónicos (doença nos rins, incluindo insuficiência renal)",
             "Não tenho nenhum dos problemas de saúde indicados")

Next, we load some packages from the tidyverse, clean the questionnaire data and convert it to narrow format tidy data.

library(tidyr)
library(dplyr)
library(glue)
data %>% separate(.,response,into = c("resp1","resp2","resp3","resp4","resp5"),
                  sep=";")  %>% group_by(id) %>%
     pivot_longer(.,c(resp1,resp2,resp3,resp4,resp5),values_to = "disease") %>%
     mutate(disease = trimws(disease)) %>%
     filter(!disease %in% c(NA," ","  ",""))    -> narrowData

At this point narrowData contains 12 observations and 3 columns.

> head(narrowData)
# A tibble: 6 x 3
# Groups:   id [4]
     id name  disease                                                             
  <int> <chr> <chr>                                                               
1     1 resp1 Não tenho nenhum dos problemas de saúde indicados                   
2     2 resp1 Doença respiratória/pulmonar (incluindo asma, bronquite crónica e d…
3     3 resp1 Doença respiratória/pulmonar (incluindo asma, bronquite crónica e d…
4     3 resp2 Hipertensão arterial (tensão arterial alta)                         
5     3 resp3 Problemas renais crónicos (doença nos rins, incluindo insuficiência…
6     4 resp1 Doença autoimmune                                                   
>

Next, we combine data from the disease vector with the survey responses to find the unique values across surveys and the input disease list.

narrowData %>% distinct(trimws(disease)) %>% .[[1]] -> diseaseList
# expanded list
combinedDiseases <- unique(c(diseaseList,disease))
disease_id <- 1:length(combinedDiseases)
diseaseData <- data.frame(disease_id,disease = combinedDiseases,
                          stringsAsFactors = FALSE)

The diseaseData data frame looks like this, where the diseases reported in questionnaires but not in the original list are at positions 6, 7, and 8.

Since we created a unique sequential number to associate with each disease name, we can now merge the data, and use the disease id number to pivot the data back to a wide format data set by survey respondent id.

narrowData %>% left_join(.,diseaseData) -> joinedData
# create wide format data 
joinedData %>% select(id,disease_id) %>% mutate(value = 2) %>%
     pivot_wider(.,id_cols = id,names_from = disease_id,names_prefix = "disease",
                 values_from = value) -> result

Finally, we set all NA values in the output to 1, and print.

result[is.na(result)] <- 1
result

...and the output:

> result
# A tibble: 8 x 9
# Groups:   id [8]
     id disease1 disease2 disease3 disease4 disease5 disease6 disease7 disease8
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1     1        2        1        1        1        1        1        1        1
2     2        1        2        1        1        1        1        1        1
3     3        1        2        2        2        1        1        1        1
4     4        1        1        1        1        2        1        1        1
5     5        1        2        2        1        1        2        1        1
6     6        1        1        1        1        1        1        2        1
7     7        2        1        1        1        1        1        1        1
8     8        1        1        1        1        1        1        1        2
>

Coding additional reported diseases as "other"

Per the comments to my answer, the OP would like any diseases reported by survey respondents that are not in the original list of diseases to be coded to a single response variable. Here is code that fulfills that requirement.

library(tidyr)
library(dplyr)
library(glue)
data %>% separate(.,response,into = c("resp1","resp2","resp3","resp4","resp5"),
                  sep=";")  %>% group_by(id) %>%
     pivot_longer(.,c(resp1,resp2,resp3,resp4,resp5),values_to = "disease") %>%
     mutate(disease = trimws(disease)) %>%
     filter(!disease %in% c(NA," ","  ",""))    -> narrowData

Once again we have a narrow format tidy data frame consisting of one row per reported disease.

Next, we process the diseases to identify reported diseases not in the original list of choices, assign them a disease id of one greater than the length of the disease vector, and create a data frame.

# create disease data frame by combining data with unique values in survey data frame
narrowData %>% distinct(trimws(disease)) %>% .[[1]] -> reportedDiseases
notInDiseaseList <- unique(reportedDiseases[!reportedDiseases %in% disease ])
disease_id <- 1:length(disease)
diseaseData <- data.frame(disease_id,disease,stringsAsFactors = FALSE)
disease_id <- rep(max(diseaseData$disease_id)+1,length(notInDiseaseList))
reportedDiseases <- data.frame(disease_id,disease = notInDiseaseList,stringsAsFactors = FALSE)
diseaseData <- rbind(diseaseData,reportedDiseases)

Notice that the reported diseases not in the original list all have the same value for disease_id .

Next, we join the diseaseData data frame with the narrow format file so we can pivot_wider() by disease id.

narrowData %>% left_join(.,diseaseData) -> joinedData

Finally, we eliminate duplicates where disease_id is equal to 6 before using `pivot_wider() to create a data frame with six columns of 1 = no disease, 2 = disease for the 5 types plus "other".

# create wide format data after eliminating 
# any duplicates where multiple reported diseases for a respondent
joinedData %>% select(id,disease_id) %>% 
     group_by(id,disease_id) %>%
     mutate(value = 2, n = row_number()) %>%
     filter(n == 1) %>% 
     pivot_wider(.,id_cols = id,names_from = disease_id,names_prefix = "disease",
                 values_from = value) -> result
result[is.na(result)] <- 1
result

...and the output:

> result
# A tibble: 8 x 7
# Groups:   id [8]
     id disease5 disease1 disease2 disease4 disease3 disease6
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1     1        2        1        1        1        1        1
2     2        1        2        1        1        1        1
3     3        1        2        2        2        1        1
4     4        1        1        1        1        2        1
5     5        1        2        2        1        1        2
6     6        1        1        1        1        1        2
7     7        2        1        1        1        1        1
8     8        1        1        1        1        1        2
>

Answer 2

gsub wasn't working due to the parenthesis. Changing the string solve this problem.

The code now is a bit longer.

 dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]
for (index in 1:length(disease)) {
    rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
    dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
}

disease <- c(" Doença respiratória/pulmonar \\(incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica\\);|Hipertensão arterial \\(tensão arterial alta\\);|Doença autoimmune;|Problemas renais crónicos \\(doença nos rins, incluindo insuficiência renal\\);Não tenho nenhum dos problemas de saúde indicados;") 
dataset$other_disease <- gsub(disease, "", dataset$Prob_saude)

Separate string into different columns R

Question

2 answers

solution1
1 2020-04-25 21:05:18

Coding additional reported diseases as "other"

solution2
0 2020-04-26 09:47:30

Separate string into different columns R

Question

2 answers

solution1 1 2020-04-25 21:05:18

Coding additional reported diseases as "other"

solution2 0 2020-04-26 09:47:30

solution1
1 2020-04-25 21:05:18

solution2
0 2020-04-26 09:47:30