[英]Left join two R data frames with OR conditions
I have two data frames that I want to join using a conditional statement on three non-numeric variables.我有两个数据框,我想在三个非数字变量上使用条件语句加入它们。 Here is a pseudo-code version of what I want to achieve.
这是我想要实现的伪代码版本。
Join DF1 and DF2 on DF1$A == DF2$A | DF1$A == DF2$B
Here's some code to create the two data frames.这是创建两个数据框的一些代码。
variant_index
is the data frame that will be used to annotate input
using a left_join
: variant_index
是将用于使用left_join
注释input
的数据框:
library(dplyr)
options(stringsAsFactors = FALSE)
set.seed(5)
variant_index <- data.frame(
rsid = rep(sapply(1:5, function(x) paste0(c("rs", sample(0:9, 8, replace = TRUE)), collapse = "")), each = 2),
chrom = rep(sample(1:22, 5), each = 2),
ref = rep(sample(c("A", "T", "C", "G"), 5, replace = TRUE), each = 2),
alt = sample(c("A", "T", "C", "G"), 10, replace = TRUE),
eaf = runif(10),
stringAsFactors = FALSE
)
variant_index[1, "alt"] <- "T"
variant_index[8, "alt"] <- "A"
input <- variant_index[seq(1, 10, 2), ] %>%
select(rsid, chrom)
input$assessed <- c("G", "C", "T", "A", "T")
I would like to perform a left_join
on input
to annotate with the eaf
column from variant_index
.我想对
input
执行left_join
以使用来自variant_index
的eaf
列进行注释。 As you can see from the input
data frame, its assessed
column can match either with input$ref
or with input$alt
.从
input
数据框中可以看出,它的assessed
列可以与input$ref
或input$alt
匹配。 The rsid
and chrom
column will always match. rsid
和chrom
列将始终匹配。
I know I can specify multiple column in the by
argument of left_join
, but if I understand correctly, the condition will always be我知道我可以在
left_join
的by
参数中指定多个列,但如果我理解正确,条件将始终是
input$assessed == variant_index$ref & input$assessed == variant_index$alt
whereas I want to achieve而我想实现
input$assessed == variant_index$ref | input$assessed == variant_index$alt
The desired output can be obtained like so:可以像这样获得所需的 output:
input %>%
left_join(variant_index) %>%
filter(assessed == ref | assessed == alt)
But it doesn't seem like the best solution to me, since I am possibly generating double the lines, and would like to apply this join to data frames containing 100M+ lines.但这对我来说似乎不是最好的解决方案,因为我可能会生成双行,并且想将此连接应用于包含 100M+ 行的数据帧。 Is there a better solution?
有更好的解决方案吗?
Complex joins are straight forward in SQL:复杂的连接在 SQL 中是直截了当的:
library(sqldf)
sqldf("select *
from variant_index v
join input i on i.assessed = v.ref or i.assessed = v.alt")
Try this尝试这个
library(dbplyr) x1 <- memdb_frame(x = 1:5) x2 <- memdb_frame(x1 = 1:3,x2 = letters[1:3]) x1 <- x1 %>% left_join(b, sql_on = "ax=b.x1 or ax=b.x2") %>% show_query()
库(dbplyr)x1 <- memdb_frame(x = 1:5)x2 <- memdb_frame(x1 = 1:3,x2 = 字母[1:3])x1 <- x1 %>% left_join(b,sql_on =“ax =b.x1 或 ax=b.x2") %>% show_query()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.