简体   繁体   中英

reshape dataframe with missing values in tidyr

I have a dataframe like:

library(tidyverse) 

df_mess <- tibble::tribble(
  ~id, ~value, ~answer_text,
  123,     25,        "age",
  123,     NA,     "female",
  234,     29,        "age",
  234,     NA,       "male",
  345,     14,        "age",
  345,     NA,     "female"
  )

I would like to reshape in a way to have "tidy" data, aka 1 row for each observation.

df <- tibble::tribble(
  ~id, ~age,     ~sex,
  123,   25, "female",
  234,   29,   "male",
  345,   14, "female"
  )

I tried a version of gather / spread , but I had no luck.

Any lead is appreciated.

If the structure of the your data is always the same I would do something like:

df_mess$new <- lead(df_mess$answer_text)
df_mess <- subset(df_mess,df_mess$value>0)

but this is a possible solution only for this particular case.

Here is a solution with spread and gather. The spread will get all variables like age where the name of the variable appears in the answer_text column. If the values of the variable are in the answer_text column (like sex in this case), you will need to gather these back like below.

In order to get the sex column to work, I changed the NAs in value to -99. You could use any value though. If you spread without something in the value column, it will show as NA in the female and male columns that are created from the spread.

df_mess[is.na(df_mess)] <- -99

df_mess %>% 
  spread(answer_text, value) %>% 
  gather(sex, temp, female, male, na.rm = TRUE) %>% 
  select(-temp)

output

# A tibble: 3 x 3
     id   age sex   
  <dbl> <dbl> <chr> 
1   123    25 female
2   345    14 female
3   234    29 male 

Example with more variables and a legitimate NA in the size variable for id 123.

   df_mess <- tibble::tribble(
  ~id, ~value, ~answer_text,
  123,     25,        "age",
  123,     NA,     "female",
  234,     29,        "age",
  234,     NA,       "male",
  345,     14,        "age",
  345,     NA,     "female",
  123,     NA,      "brown",
  234,     NA,     "blonde",
  345,     NA,      "black",
  123,     NA,       "size",
  234,     30,       "size",
  345,     40,       "size",

)
df_mess[is.na(df_mess)] <- -99
df_clean <- df_mess %>% 
  spread(answer_text, value) %>% 
  gather(sex, temp, female, male, na.rm = TRUE) %>% 
  select(-temp) %>% 
  gather(hair, temp, black:brown, na.rm = TRUE) %>% 
  select(-temp)

df_clean[df_clean == -99] <- NA
df_clean

output

     id   age  size sex    hair  
  <dbl> <dbl> <dbl> <chr>  <chr> 
1   345    14    40 female black 
2   234    29    30 male   blonde
3   123    25    NA female brown

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM