简体   繁体   中英

Merging two dataframe with dplyr left join?

I have two data frames, competitor_sheet and left_join_prototype , which I'll call df1 and df2, respectively, df1 and df2 each contain information on genes and disease IDs, and look like this:

df1:

HUGO_symbol MeSH_ID
P53 D000310
A1BG D0002277
ZZZ3 D000230

df2:

Gene.Name Parent.MeSH.ID Child.MeSH.ID
P53 D000310 D015675, D006676
HGA2 D031031 D002277
ZZZ3 D001163, D000230 D003451

As you can see, df2 may contain multiple IDs in both the Parent and Child MeSH ID's. Essentially, what I want to do is left join any rows from df2 to df1 if any of the MeSH ID's in either the child or parent correspond to MeSH ID's in the df1, but only if the Gene.Name = HUGO_symbol, the final product should look like this:

HUGO_symbol MeSH_ID Gene.Name Parent.Mesh.ID Child.MeSH.ID
P53 D000310 P53 D000310 D015675, D006676
ZZZ3 D000230 ZZZ3 D001163, D00230 D003451

I've tried something using dplyr that looks something like this:

proto <- left_join(df1,df2, by = c("MeSH_ID"="Parent.MeSH.ID", "HUGO_Symbol"="Gene.Name", "MeSH_ID2"="Child.MeSH.ID")) where proto would be the new table but I have a feeling this is wrong.

As always, any help is hugely appreciated :)

The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could

  1. unite Parent.MeSH.ID and Child.MeSH.ID into a new column MeSH_ID
  2. split the united columns in separate IDs using eg tidyr::separate_rows . Doing so makes it possible to join the df's by ID.
  3. Use an semi_join to filter out rows in df1 with matches in the newly created df3, finally do a left_join to add the columns from df3. Or if doesn't matter to keep both HUGO_symbol and Gene.Name you could achieve both steps with an inner_join .
df1 <- data.frame(
  stringsAsFactors = FALSE,
       HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
           MeSH_ID = c("D000310", "D0002277", "D000230")
)

df2 <- data.frame(
  stringsAsFactors = FALSE,
         Gene.Name = c("P53", "HGA2", "ZZZ3"),
    Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
     Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)

library(dplyr)
library(tidyr)

df3 <- df2 %>%
  unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>% 
  separate_rows(MeSH_ID, sep = ", ")

semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>% 
  left_join(df3)
#> Joining, by = "MeSH_ID"
#>   HUGO_symbol MeSH_ID Gene.Name   Parent.MeSH.ID    Child.MeSH.ID
#> 1         P53 D000310       P53          D000310 D015675, D006676
#> 2        ZZZ3 D000230      ZZZ3 D001163, D000230          D003451

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM