简体   繁体   English

用 OR 条件左连接两个 R 数据帧

[英]Left join two R data frames with OR conditions

Problem问题

I have two data frames that I want to join using a conditional statement on three non-numeric variables.我有两个数据框,我想在三个非数字变量上使用条件语句加入它们。 Here is a pseudo-code version of what I want to achieve.这是我想要实现的伪代码版本。

Join DF1 and DF2 on DF1$A == DF2$A | DF1$A == DF2$B

Dataset数据集

Here's some code to create the two data frames.这是创建两个数据框的一些代码。 variant_index is the data frame that will be used to annotate input using a left_join : variant_index是将用于使用left_join注释input的数据框:

library(dplyr)
options(stringsAsFactors = FALSE)

set.seed(5)
variant_index <- data.frame(
  rsid   = rep(sapply(1:5, function(x) paste0(c("rs", sample(0:9, 8, replace = TRUE)), collapse = "")), each = 2),
  chrom  = rep(sample(1:22, 5), each = 2),
  ref    = rep(sample(c("A", "T", "C", "G"), 5, replace = TRUE), each = 2),
  alt    = sample(c("A", "T", "C", "G"), 10, replace = TRUE),
  eaf    = runif(10),
  stringAsFactors = FALSE
)
variant_index[1, "alt"] <- "T"
variant_index[8, "alt"] <- "A"

input <- variant_index[seq(1, 10, 2), ] %>%
  select(rsid, chrom)
input$assessed <- c("G", "C", "T", "A", "T")

What I tried我试过的

I would like to perform a left_join on input to annotate with the eaf column from variant_index .我想对input执行left_join以使用来自variant_indexeaf列进行注释。 As you can see from the input data frame, its assessed column can match either with input$ref or with input$alt .input数据框中可以看出,它的assessed列可以与input$refinput$alt匹配。 The rsid and chrom column will always match. rsidchrom列将始终匹配。

I know I can specify multiple column in the by argument of left_join , but if I understand correctly, the condition will always be我知道我可以在left_joinby参数中指定多个列,但如果我理解正确,条件将始终是

input$assessed == variant_index$ref & input$assessed == variant_index$alt

whereas I want to achieve而我想实现

input$assessed == variant_index$ref | input$assessed == variant_index$alt

Possible solution可能的解决方案

The desired output can be obtained like so:可以像这样获得所需的 output:

input %>% 
  left_join(variant_index) %>% 
  filter(assessed == ref | assessed == alt)

But it doesn't seem like the best solution to me, since I am possibly generating double the lines, and would like to apply this join to data frames containing 100M+ lines.但这对我来说似乎不是最好的解决方案,因为我可能会生成双行,并且想将此连接应用于包含 100M+ 行的数据帧。 Is there a better solution?有更好的解决方案吗?

Complex joins are straight forward in SQL:复杂的连接在 SQL 中是直截了当的:

library(sqldf)

sqldf("select *
  from variant_index v
  join input i on i.assessed = v.ref or i.assessed = v.alt")

Try this尝试这个

library(dbplyr) x1 <- memdb_frame(x = 1:5) x2 <- memdb_frame(x1 = 1:3,x2 = letters[1:3]) x1 <- x1 %>% left_join(b, sql_on = "ax=b.x1 or ax=b.x2") %>% show_query()库(dbplyr)x1 <- memdb_frame(x = 1:5)x2 <- memdb_frame(x1 = 1:3,x2 = 字母[1:3])x1 <- x1 %>% left_join(b,sql_on =“ax =b.x1 或 ax=b.x2") %>% show_query()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM