Let's say I have multiple columns in a data frame that measure the same concept, but in different methods (eg there are multiple kinds of IQ tests, and students could have any one of them, or none at all). I want to combine the various methods into a single column (obvious use case for tidyr).
If the data is something like this:
mydata <- data.frame(ID = 55:64,
age = c(12, 12, 14, 11, 20, 10, 13, 15, 18, 17),
Test1 = c(100, 90, 88, 115, NA, NA, NA, NA, NA, NA),
Test2 = c(NA, NA, NA, NA, 100, 120, NA, NA, NA, NA),
Test3 = c( NA, NA, NA, NA, NA, NA, 110, NA, 85, 150))
I would naturally want to perform something like this (note that I use na.rm = TRUE in order to not have the many many NA's in my data set get their own rows):
library(tidyr)
tests <- gather(mydata, key=IQSource, value=IQValue, c(Test1, Test2, Test3), na.rm = TRUE)
tests
Giving me:
ID age IQSource IQValue 1 55 12 Test1 100 2 56 12 Test1 90 3 57 14 Test1 88 4 58 11 Test1 115 15 59 20 Test2 100 16 60 10 Test2 120 27 61 13 Test3 110 29 63 18 Test3 85 30 64 17 Test3 150
The problem is that I have a student (ID=62) that doesn't have any IQ scores in any of the three, and I don't want to lose her other data (the data in the ID and age columns).
Is there a way to distinguish, in tidyr, that yes, I want to remove NA's where I do have data in at least one column I'm gathering, yet at the same time want to prevent data loss when all of the columns to gather are NA?)
I did'nt find a direct solution, but you could right_join
back the original data.frame
and then deselect all columns which you don't need.
library(tidyr)
library(dplyr)
mydata %>%
gather(key, val, Test1:Test3, na.rm = T) %>%
right_join(mydata) %>%
select(-contains("Test"))
#> Joining, by = c("ID", "age")
#> ID age key val
#> 1 55 12 Test1 100
#> 2 56 12 Test1 90
#> 3 57 14 Test1 88
#> 4 58 11 Test1 115
#> 5 59 20 Test2 100
#> 6 60 10 Test2 120
#> 7 61 13 Test3 110
#> 8 62 15 <NA> NA
#> 9 63 18 Test3 85
#> 10 64 17 Test3 150
Alternatively, you could of course first create a data.frame
with all the variables you want to keep and then join it:
id_data <- select(mydata, ID, age)
mydata %>%
gather(key, val, Test1:Test3, na.rm = T) %>%
right_join(id_data)
I think this will do the trick for you:
# make another data frame which has just ID and whether or not they missed all 3 tests
missing = mydata %>%
mutate(allNA = is.na(Test1) & is.na(Test2) & is.na(Test3)) %>%
select(ID, allNA)
# Gather and keep NAs
tests <- gather(mydata, key=IQSource, value=IQValue, c(Test1, Test2, Test3), na.rm = FALSE)
# Keep the rows that have a IQValue or missed all tests
tests = left_join(tests, missing) %>%
filter(!is.na(IQValue) | allNA)
# Remove duplicated rows of individuals who missed all exams
tests = tests[!is.na(tests$IQValue) | !duplicated(tests[["ID"]]), ]
If students can each have only one IQ test...
library(tidyverse)
mydata %>%
gather(key=IQSource, value=IQValue, Test1:Test3) %>%
group_by(ID) %>%
arrange(IQValue) %>%
slice(1)
ID age IQSource IQValue 1 55 12 Test1 100 2 56 12 Test1 90 3 57 14 Test1 88 4 58 11 Test1 115 5 59 20 Test2 100 6 60 10 Test2 120 7 61 13 Test3 110 8 62 15 Test1 NA 9 63 18 Test3 85 10 64 17 Test3 150
If students can each have multiple IQ tests...
mydata %>%
# Add an ID with multiple IQ tests
bind_rows(data.frame(ID=65, age=13, Test1=100, Test2=100, Test3=NA)) %>%
gather(key=IQSource, value=IQValue, Test1:Test3) %>%
group_by(ID) %>%
filter(!is.na(IQValue) | all(is.na(IQValue))) %>%
filter(all(!is.na(IQValue)) | !duplicated(IQValue)) %>%
arrange(ID, IQSource)
ID age IQSource IQValue 1 55 12 Test1 100 2 56 12 Test1 90 3 57 14 Test1 88 4 58 11 Test1 115 5 59 20 Test2 100 6 60 10 Test2 120 7 61 13 Test3 110 8 62 15 Test1 NA 9 63 18 Test3 85 10 64 17 Test3 150 11 65 13 Test1 100 12 65 13 Test2 100
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.