简体   繁体   中英

For each ID, separate groups into columns and collapse multiple value strings in R

I have a dataframe that looks like this:

in.dat <- data.frame(ID = c("A1", "A1", "A1", "A1", "B1", "B1", "B1", "B1"),
           DB = rep(c("bio", "bio", "func", "loc"), 2),
           val = c("IPR1", "IPR2", "s43", "333-456", 
                   "IPR7", "IPR8", "q87", "566-900"))

  ID   DB     val
1 A1  bio    IPR1
2 A1  bio    IPR2
3 A1 func     s43
4 A1  loc 333-456
5 B1  bio    IPR7
6 B1  bio    IPR8
7 B1 func     q87
8 B1  loc 566-900

I want to turn "DB" into columns and take the string values and collapse by ";"

out.dat <- data.frame(ID = c("A1", "B1"),
                  bio = c("IPR1;IPR2", "IPR7;IPR8"),
                  func = c("s47", "q87"),
                  loc = c("333-456", "566-900"))

> out
  ID       bio func     loc
1 A1 IPR1;IPR2  s47 333-456
2 B1 IPR7;IPR8  q87 566-900

I've played around with pivot_wider and group using dplyr but not quite getting what I want, since a group can have multiple values per ID that I want to collapse into one cell (eg, "IPR1;IPR2")

Any solution would be appreciated!

pivot_wider in recent tidyr versions takes an argument values_fn for a function that aggregates values before reshaping. This lets you do your operation in one function call.

library(tidyr)

in.dat %>%
  pivot_wider(names_from = DB, values_from = val, 
              values_fn = list(val = ~paste(., collapse = ";")))
#> # A tibble: 2 x 4
#>   ID    bio       func  loc    
#>   <fct> <chr>     <chr> <chr>  
#> 1 A1    IPR1;IPR2 s43   333-456
#> 2 B1    IPR7;IPR8 q87   566-900

We can collapse val by ID and DB and then use pivot_wider .

library(dplyr)

in.dat %>%
  group_by(ID, DB) %>%
  summarise(val = paste0(val, collapse = ";")) %>%
  tidyr::pivot_wider(names_from = DB, values_from = val)

#  ID    bio       func  loc    
#  <fct> <chr>     <chr> <chr>  
#1 A1    IPR1;IPR2 s43   333-456
#2 B1    IPR7;IPR8 q87   566-900

You can use dcast to do this.

in.dat <- data.frame(ID = c("A1", "A1", "A1", "A1", "B1", "B1", "B1", "B1"),
                     DB = rep(c("bio", "bio", "func", "loc"), 2),
                     val = c("IPR1", "IPR2", "s43", "333-456", 
                             "IPR7", "IPR8", "q87", "566-900"))

library(reshape2)
dcast(in.dat, ID ~ DB, paste0, collapse = ";")
#  ID       bio func     loc
#1 A1 IPR1;IPR2  s43 333-456
#2 B1 IPR7;IPR8  q87 566-900

We can also use spread with str_c

library(dplyr)
library(tidyr)
library(stringr)
in.dat %>% 
   group_by(ID, DB) %>% 
   summarise(val = str_c(val, collapse=";")) %>% 
   spread(DB, val)
# A tibble: 2 x 4
# Groups:   ID [2]
#   ID    bio       func  loc    
#   <fct> <chr>     <chr> <chr>  
#1 A1    IPR1;IPR2 s43   333-456
#2 B1    IPR7;IPR8 q87   566-900

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM