Add column to dataframe according to values in multiple dataframes in dplyr

Question

I have a dataframe target that contains columns SNP and value :

target <- data.frame("SNP" = c("rs2", "rs4", "rs6", "rs19", "rs8", "rs9"),
                     "value" = 1:6)

I have 3 other dataframes that contains columns SNP and int as a list:

ref1 <- data.frame("SNP" = c("rs1", "rs2", "rs8"), "int" = c(5, 7, 88))
ref2 <- data.frame("SNP" = c("rs9", "rs4", "rs3"), "int" = c(23, 4, 43))
ref3 <- data.frame("SNP" = c("rs10", "rs6", "rs5"), "int" = c(53, 22, 76))
mylist <- list(ref1, ref2, ref3)

I want to add a new column int for target whose values correspond to the int values of the ref1/2/3 with the same SNP . For example, the first int value for target should be 7 because row 2 of ref1 has SNP of rs2 and int of 7.

I tried the following code:

for (i in 1:3) {
    target <- target %>%
                left_join(mylist[[i]], by = "SNP")
}

The matching was quick and successful. However, I was returned 3 new columns rather than 1, shown as below:

I then used the following code:

target[, "ref"] <- NA
for (i in 1:3) {
    common <- Reduce(intersect, list(target$SNP, mylist[[i]]$SNP))

    tar.pos <- match(common, target$SNP)
    ref.pos <- match(common, mylist[[i]]$SNP)

    target$ref[tar.pos] <- mylist[[i]]$int[ref.pos]
}

In my real data, I have 22 ref dataframes with each have 1-6 MILLION rows. I would prefer to do the matching and joining ref by ref rather than merging all refs into one single large data. When I tried the second method above on my real data, I noted that the match function worked very slow. This is why I prefer some smart way of doing the work. I found left_join worked very fast even for my large data. Unfortunately, the output is not exactly what I want.

I wish to do the above work in a fast way, preferably in tidyverse. Is there any suggestion on how do revise the first coding method or any other smarter way?

Answer 1

If binding all data in mylist and merging to target takes too much memory, you can use purrr::reduce to merge one by one.

library(tidyverse)

reduce(mylist,
       ~ left_join(.x, .y, by = "SNP") %>%
         mutate(int = coalesce(int.x, int.y)) %>%
         select(-c(int.x, int.y)),
       .init = mutate(target, int = NA_real_))

#    SNP value int
# 1  rs2     1   7
# 2  rs4     2   4
# 3  rs6     3  22
# 4 rs19     4  NA
# 5  rs8     5  88
# 6  rs9     6  23

Answer 2

With tidyverse , we can also do

library(dplyr)
bind_rows(mylist) %>%
  right_join(target, by = "SNP")

Answer 3

You could convert mylist into one dataframe and then merge with target

merge(target, do.call(rbind, mylist), by = "SNP", all.x = TRUE)

#   SNP value int
#1 rs19     4  NA
#2  rs2     1   7
#3  rs4     2   4
#4  rs6     3  22
#5  rs8     5  88
#6  rs9     6  23

Or using dplyr

library(dplyr)
left_join(target, bind_rows(mylist), by = "SNP")

Or in data.table

library(data.table)
rbindlist(mylist)[target, on = 'SNP']

Add column to dataframe according to values in multiple dataframes in dplyr

Question

3 answers

solution1
1 ACCPTED 2020-02-29 15:32:45

solution2
1 2020-02-29 17:52:02

solution3
0 2020-02-29 13:35:18

Add column to dataframe according to values in multiple dataframes in dplyr

Question

3 answers

solution1 1 ACCPTED 2020-02-29 15:32:45

solution2 1 2020-02-29 17:52:02

solution3 0 2020-02-29 13:35:18

solution1
1 ACCPTED 2020-02-29 15:32:45

solution2
1 2020-02-29 17:52:02

solution3
0 2020-02-29 13:35:18