I have a dataframe target
that contains columns SNP
and value
:
target <- data.frame("SNP" = c("rs2", "rs4", "rs6", "rs19", "rs8", "rs9"),
"value" = 1:6)
I have 3 other dataframes that contains columns SNP
and int
as a list:
ref1 <- data.frame("SNP" = c("rs1", "rs2", "rs8"), "int" = c(5, 7, 88))
ref2 <- data.frame("SNP" = c("rs9", "rs4", "rs3"), "int" = c(23, 4, 43))
ref3 <- data.frame("SNP" = c("rs10", "rs6", "rs5"), "int" = c(53, 22, 76))
mylist <- list(ref1, ref2, ref3)
I want to add a new column int
for target
whose values correspond to the int
values of the ref1/2/3 with the same SNP
. For example, the first int
value for target
should be 7 because row 2 of ref1 has SNP
of rs2 and int
of 7.
I tried the following code:
for (i in 1:3) {
target <- target %>%
left_join(mylist[[i]], by = "SNP")
}
The matching was quick and successful. However, I was returned 3 new columns rather than 1, shown as below:
I then used the following code:
target[, "ref"] <- NA
for (i in 1:3) {
common <- Reduce(intersect, list(target$SNP, mylist[[i]]$SNP))
tar.pos <- match(common, target$SNP)
ref.pos <- match(common, mylist[[i]]$SNP)
target$ref[tar.pos] <- mylist[[i]]$int[ref.pos]
}
In my real data, I have 22 ref dataframes with each have 1-6 MILLION rows. I would prefer to do the matching and joining ref by ref rather than merging all refs into one single large data. When I tried the second method above on my real data, I noted that the match
function worked very slow. This is why I prefer some smart way of doing the work. I found left_join worked very fast even for my large data. Unfortunately, the output is not exactly what I want.
I wish to do the above work in a fast way, preferably in tidyverse. Is there any suggestion on how do revise the first coding method or any other smarter way?
If binding all data in mylist
and merging to target
takes too much memory, you can use purrr::reduce
to merge one by one.
library(tidyverse)
reduce(mylist,
~ left_join(.x, .y, by = "SNP") %>%
mutate(int = coalesce(int.x, int.y)) %>%
select(-c(int.x, int.y)),
.init = mutate(target, int = NA_real_))
# SNP value int
# 1 rs2 1 7
# 2 rs4 2 4
# 3 rs6 3 22
# 4 rs19 4 NA
# 5 rs8 5 88
# 6 rs9 6 23
With tidyverse
, we can also do
library(dplyr)
bind_rows(mylist) %>%
right_join(target, by = "SNP")
You could convert mylist
into one dataframe and then merge
with target
merge(target, do.call(rbind, mylist), by = "SNP", all.x = TRUE)
# SNP value int
#1 rs19 4 NA
#2 rs2 1 7
#3 rs4 2 4
#4 rs6 3 22
#5 rs8 5 88
#6 rs9 6 23
Or using dplyr
library(dplyr)
left_join(target, bind_rows(mylist), by = "SNP")
Or in data.table
library(data.table)
rbindlist(mylist)[target, on = 'SNP']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.