简体   繁体   中英

Add column to dataframe according to values in multiple dataframes in dplyr

I have a dataframe target that contains columns SNP and value :

target <- data.frame("SNP" = c("rs2", "rs4", "rs6", "rs19", "rs8", "rs9"),
                     "value" = 1:6)

I have 3 other dataframes that contains columns SNP and int as a list:

ref1 <- data.frame("SNP" = c("rs1", "rs2", "rs8"), "int" = c(5, 7, 88))
ref2 <- data.frame("SNP" = c("rs9", "rs4", "rs3"), "int" = c(23, 4, 43))
ref3 <- data.frame("SNP" = c("rs10", "rs6", "rs5"), "int" = c(53, 22, 76))
mylist <- list(ref1, ref2, ref3)

I want to add a new column int for target whose values correspond to the int values of the ref1/2/3 with the same SNP . For example, the first int value for target should be 7 because row 2 of ref1 has SNP of rs2 and int of 7.

I tried the following code:

for (i in 1:3) {
    target <- target %>%
                left_join(mylist[[i]], by = "SNP")
}

The matching was quick and successful. However, I was returned 3 new columns rather than 1, shown as below: 在此处输入图片说明

I then used the following code:

target[, "ref"] <- NA
for (i in 1:3) {
    common <- Reduce(intersect, list(target$SNP, mylist[[i]]$SNP))

    tar.pos <- match(common, target$SNP)
    ref.pos <- match(common, mylist[[i]]$SNP)

    target$ref[tar.pos] <- mylist[[i]]$int[ref.pos]
}

In my real data, I have 22 ref dataframes with each have 1-6 MILLION rows. I would prefer to do the matching and joining ref by ref rather than merging all refs into one single large data. When I tried the second method above on my real data, I noted that the match function worked very slow. This is why I prefer some smart way of doing the work. I found left_join worked very fast even for my large data. Unfortunately, the output is not exactly what I want.

I wish to do the above work in a fast way, preferably in tidyverse. Is there any suggestion on how do revise the first coding method or any other smarter way?

If binding all data in mylist and merging to target takes too much memory, you can use purrr::reduce to merge one by one.

library(tidyverse)

reduce(mylist,
       ~ left_join(.x, .y, by = "SNP") %>%
         mutate(int = coalesce(int.x, int.y)) %>%
         select(-c(int.x, int.y)),
       .init = mutate(target, int = NA_real_))

#    SNP value int
# 1  rs2     1   7
# 2  rs4     2   4
# 3  rs6     3  22
# 4 rs19     4  NA
# 5  rs8     5  88
# 6  rs9     6  23

With tidyverse , we can also do

library(dplyr)
bind_rows(mylist) %>%
  right_join(target, by = "SNP")

You could convert mylist into one dataframe and then merge with target

merge(target, do.call(rbind, mylist), by = "SNP", all.x = TRUE)

#   SNP value int
#1 rs19     4  NA
#2  rs2     1   7
#3  rs4     2   4
#4  rs6     3  22
#5  rs8     5  88
#6  rs9     6  23

Or using dplyr

library(dplyr)
left_join(target, bind_rows(mylist), by = "SNP")

Or in data.table

library(data.table)
rbindlist(mylist)[target, on = 'SNP']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM