简体   繁体   中英

Joining two datasets on comma-delimited column

I have a large dataset that I'm coding with both human-readable and machine-readable identifiers. I'd like to type in only the human-readable codes, and use a merge in R to add the machine-readable ones. Only hitch is that I'm adding multiple identifiers into the column, separated by commas. It looks a bit like this:

df <- as.data.frame(cbind(identifier=c("a","a, b","b","b, c","c"), data=c(1,2,3,4,5)))

codebook <- as.data.frame(cbind(id=c("a","b", "c","d"),code=c('9999','8888','7777','6666')))

What I'd like to get in this end would look like this:

 answer <- as.data.frame(cbind(identifier=c("a","a, b","b","b, c","c"), code=c('9999', '9999, 8888', '8888', '8888, 7777', '7777'), data=c(1,2,3,4,5)))

I've experimented with separate() and unite() in dplyr, but I'm wondering if there's a simpler way.

This doesn't give your your exact output, but it may be more desirable as it is easy to work with (it is more "tidy", if you like the Wickham verbiage):

df %>%
  mutate(new_1 = gsub("(.*)[, ](.*)", "\\1", identifier),
         new_2 = gsub("(.*)[, ](.*)", "\\2", identifier)) %>%
  mutate(new_2 = ifelse(new_1 == new_2, NA, new_2)) %>%
  select(data, new_1, new_2) %>%
  melt("data") %>%
  inner_join(codebook, by = c("value" = "id"))

#   data variable value code
# 1    1    new_1     a 9999
# 2    2    new_1     a 9999
# 3    3    new_1     b 8888
# 4    4    new_1     b 8888
# 5    5    new_1     c 7777
# 6    2    new_2     b 8888
# 7    4    new_2     c 7777

Try separate_rows . First convert the factor columns to character. Then use separate_rows to unnest df , left join it to codebook and convert back. Note that the result has character columns.

library(dplyr)
library(tidyr)

df %>%
   mutate_all(as.character) %>%
   separate_rows(identifier) %>% 
   left_join(codebook %>% mutate_all(as.character), by = c("identifier" = "id")) %>% 
   group_by(data) %>% 
   summarize(identifier = toString(identifier), code = toString(code)) %>%
   ungroup

giving:

# A tibble: 5 x 3
  data  identifier code      
  <chr> <chr>      <chr>     
1 1     a          9999      
2 2     a, b       9999, 8888
3 3     b          8888      
4 4     b, c       8888, 7777
5 5     c          7777      

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM