I have data from a questionnaire. One of the questions is multiple choice and includes the option "other", which the user can write something else. I receive an Excel file with one column for that specific question, each option selected separated by a semi-colon. Example of the dataset below:
ID Prob_saude
1 "Não tenho nenhum dos problemas de saúde indicados;"
2 " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);"
3 " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);"
4 "Doença autoimmune;"
5 " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;"
6 "HIV;"
7 "Não tenho nenhum dos problemas de saúde indicados;"
8 "Cardiológica;"
I want to create a column for each disease with yes/no in case the user has selected that option. Then I want to create another column with the option other. In this case, the options available were:
disease <- c(" Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);",
"Hipertensão arterial (tensão arterial alta);", "Doença autoimmune;"
"Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);",
"Não tenho nenhum dos problemas de saúde indicados;")
The desired output would be this:
ID Prob_saude_1 Prob_saude_2 Prob_saude_3 Prob_saude_4 Prob_saude_5 Prob_saude_6 Prob_saude
1 1 1 1 1 2 NA "Não tenho nenhum dos problemas de saúde indicados;"
2 2 1 1 1 1 NA " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);"
3 2 2 1 2 1 NA " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);"
4 1 1 2 1 1 NA "Doença autoimmune;"
5 2 2 1 1 1 "Diabetes;" " Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;"
6 1 1 1 1 1 "HIV;" "HIV;"
7 1 1 1 1 2 NA "Não tenho nenhum dos problemas de saúde indicados;"
8 1 1 1 1 1 "Cardiológica;" "Cardiológica;"
I'm able to create the extra columns based on the option, but when I try to create the column for other the output given is equal to the column Prob_saude, so it doesn't exclude the options already selected. Any ideas? This is what I have so far. Feel free to give any suggestions, if you think there's a better way to achieve this.
dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]
for (index in 1:length(disease)) {
rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := gsub(disease[index], "", dataset$Prob_saude, fixed = T)]
}
One way to handle this situation would be to combine the items listed as "other" in the list of disease types. Given the data, there are 5 disease types in the original disease
vector, and three new ones from the questionnaires.
First, after some clean up we read the data posted with the question.
textFile <- "id|response
1|Não tenho nenhum dos problemas de saúde indicados;
2| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);
3| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Problemas renais crónicos (doença nos rins, incluindo insuficiência renal);
4|Doença autoimmune;
5| Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica);Hipertensão arterial (tensão arterial alta);Diabetes;
6|HIV;
7|Não tenho nenhum dos problemas de saúde indicados;
8|Cardiológica; "
data <- read.csv(text = textFile,sep = "|",
header = TRUE, stringsAsFactors = FALSE)
disease <- c("Doença respiratória/pulmonar (incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica)",
"Hipertensão arterial (tensão arterial alta)",
"Doença autoimmune",
"Problemas renais crónicos (doença nos rins, incluindo insuficiência renal)",
"Não tenho nenhum dos problemas de saúde indicados")
Next, we load some packages from the tidyverse, clean the questionnaire data and convert it to narrow format tidy data.
library(tidyr)
library(dplyr)
library(glue)
data %>% separate(.,response,into = c("resp1","resp2","resp3","resp4","resp5"),
sep=";") %>% group_by(id) %>%
pivot_longer(.,c(resp1,resp2,resp3,resp4,resp5),values_to = "disease") %>%
mutate(disease = trimws(disease)) %>%
filter(!disease %in% c(NA," "," ","")) -> narrowData
At this point narrowData
contains 12 observations and 3 columns.
> head(narrowData)
# A tibble: 6 x 3
# Groups: id [4]
id name disease
<int> <chr> <chr>
1 1 resp1 Não tenho nenhum dos problemas de saúde indicados
2 2 resp1 Doença respiratória/pulmonar (incluindo asma, bronquite crónica e d…
3 3 resp1 Doença respiratória/pulmonar (incluindo asma, bronquite crónica e d…
4 3 resp2 Hipertensão arterial (tensão arterial alta)
5 3 resp3 Problemas renais crónicos (doença nos rins, incluindo insuficiência…
6 4 resp1 Doença autoimmune
>
Next, we combine data from the disease
vector with the survey responses to find the unique values across surveys and the input disease list.
narrowData %>% distinct(trimws(disease)) %>% .[[1]] -> diseaseList
# expanded list
combinedDiseases <- unique(c(diseaseList,disease))
disease_id <- 1:length(combinedDiseases)
diseaseData <- data.frame(disease_id,disease = combinedDiseases,
stringsAsFactors = FALSE)
The diseaseData
data frame looks like this, where the diseases reported in questionnaires but not in the original list are at positions 6, 7, and 8.
Since we created a unique sequential number to associate with each disease name, we can now merge the data, and use the disease id number to pivot the data back to a wide format data set by survey respondent id.
narrowData %>% left_join(.,diseaseData) -> joinedData
# create wide format data
joinedData %>% select(id,disease_id) %>% mutate(value = 2) %>%
pivot_wider(.,id_cols = id,names_from = disease_id,names_prefix = "disease",
values_from = value) -> result
Finally, we set all NA values in the output to 1, and print.
result[is.na(result)] <- 1
result
...and the output:
> result
# A tibble: 8 x 9
# Groups: id [8]
id disease1 disease2 disease3 disease4 disease5 disease6 disease7 disease8
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 1 1 1 1 1 1
2 2 1 2 1 1 1 1 1 1
3 3 1 2 2 2 1 1 1 1
4 4 1 1 1 1 2 1 1 1
5 5 1 2 2 1 1 2 1 1
6 6 1 1 1 1 1 1 2 1
7 7 2 1 1 1 1 1 1 1
8 8 1 1 1 1 1 1 1 2
>
Per the comments to my answer, the OP would like any diseases reported by survey respondents that are not in the original list of diseases to be coded to a single response variable. Here is code that fulfills that requirement.
library(tidyr)
library(dplyr)
library(glue)
data %>% separate(.,response,into = c("resp1","resp2","resp3","resp4","resp5"),
sep=";") %>% group_by(id) %>%
pivot_longer(.,c(resp1,resp2,resp3,resp4,resp5),values_to = "disease") %>%
mutate(disease = trimws(disease)) %>%
filter(!disease %in% c(NA," "," ","")) -> narrowData
Once again we have a narrow format tidy data frame consisting of one row per reported disease.
Next, we process the diseases to identify reported diseases not in the original list of choices, assign them a disease id of one greater than the length of the disease
vector, and create a data frame.
# create disease data frame by combining data with unique values in survey data frame
narrowData %>% distinct(trimws(disease)) %>% .[[1]] -> reportedDiseases
notInDiseaseList <- unique(reportedDiseases[!reportedDiseases %in% disease ])
disease_id <- 1:length(disease)
diseaseData <- data.frame(disease_id,disease,stringsAsFactors = FALSE)
disease_id <- rep(max(diseaseData$disease_id)+1,length(notInDiseaseList))
reportedDiseases <- data.frame(disease_id,disease = notInDiseaseList,stringsAsFactors = FALSE)
diseaseData <- rbind(diseaseData,reportedDiseases)
Notice that the reported diseases not in the original list all have the same value for disease_id
.
Next, we join the diseaseData
data frame with the narrow format file so we can pivot_wider()
by disease id.
narrowData %>% left_join(.,diseaseData) -> joinedData
Finally, we eliminate duplicates where disease_id
is equal to 6 before using `pivot_wider() to create a data frame with six columns of 1 = no disease, 2 = disease for the 5 types plus "other".
# create wide format data after eliminating
# any duplicates where multiple reported diseases for a respondent
joinedData %>% select(id,disease_id) %>%
group_by(id,disease_id) %>%
mutate(value = 2, n = row_number()) %>%
filter(n == 1) %>%
pivot_wider(.,id_cols = id,names_from = disease_id,names_prefix = "disease",
values_from = value) -> result
result[is.na(result)] <- 1
result
...and the output:
> result
# A tibble: 8 x 7
# Groups: id [8]
id disease5 disease1 disease2 disease4 disease3 disease6
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 1 1 1 1
2 2 1 2 1 1 1 1
3 3 1 2 2 2 1 1
4 4 1 1 1 1 2 1
5 5 1 2 2 1 1 2
6 6 1 1 1 1 1 2
7 7 2 1 1 1 1 1
8 8 1 1 1 1 1 2
>
gsub
wasn't working due to the parenthesis. Changing the string solve this problem.
The code now is a bit longer.
dataset[, paste("Prob_saude", length(disease)+1, sep = "_") := Prob_saude]
for (index in 1:length(disease)) {
rows <- grep(disease[index], dataset$Prob_saude, fixed = T)
dataset[, paste("Prob_saude", index, sep = "_") := ifelse(rownames(dataset) %in% rows, 2, ifelse(is.na(dataset$Prob_saude), NA, 1))]
}
disease <- c(" Doença respiratória/pulmonar \\(incluindo asma, bronquite crónica e doença pulmonar obstrutiva crónica\\);|Hipertensão arterial \\(tensão arterial alta\\);|Doença autoimmune;|Problemas renais crónicos \\(doença nos rins, incluindo insuficiência renal\\);Não tenho nenhum dos problemas de saúde indicados;")
dataset$other_disease <- gsub(disease, "", dataset$Prob_saude)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.