简体   繁体   English

使用 OR 条件为要匹配的列连接数据框

[英]Join dataframes using an OR condition for columns to match by

Suppose we have two dataframes we want to join.假设我们有两个要加入的数据框。 For the sake of simplicity, suppose both have the same number of rows, and that each row in one dataframe has to have a unique corresponding row in the other dataframe.为简单起见,假设两者具有相同的行数,并且一个数据帧中的每一行必须在另一个数据帧中具有唯一的对应行。 Here is a MWE of this setup:这是此设置的 MWE:

library(dplyr)
df1 <- tibble(id1 = 1:3, x = c("b", "a", "b"), y = c(55, 50, 58), z = c(65, 60, 69))
df2 <- tibble(id2 = 11:13, x = c("a", "b", "b"), y = c(50, 55, 59), z = c(61, 66, 69))

# # A tibble: 3 x 4
#     id1 x         y     z
#   <int> <chr> <dbl> <dbl>
# 1     1 b        55    65
# 2     2 a        50    60
# 3     3 b        58    69
# # A tibble: 3 x 4
#     id2 x         y     z
#   <int> <chr> <dbl> <dbl>
# 1    11 a        50    61
# 2    12 b        55    66
# 3    13 b        59    69

The goal is to link the unique IDs in df1 with the unique but different IDs in df2 .目标是将df1中的唯一 ID 与df2唯一但不同的 ID 链接起来。 I can do so by exploiting the fact that the combined information in columns x , y , and z is sufficient to uniquely identify the rows.我可以通过利用xyz列中的组合信息足以唯一地标识行这一事实来做到这一点。 The problem, explained in detail below, is that the join would work correctly in this setup if we were able to join by something like c("x", "y" OR "z") .下面详细解释的问题是,如果我们能够通过诸如c("x", "y" OR "z")类的内容进行连接,连接将在此设置中正常工作。

  • Starting from column x , we have enough information to link id1 == 2 with id2==11 , because they share the same unique value (ie "a" ).x列开始,我们有足够的信息将id1 == 2id2==11 ,因为它们共享相同的唯一值(即"a" )。 However, column x alone is not enough for the remaining pairs.但是,对于其余的对,仅列x是不够的。
  • Using column y in the join allows us to link id1 == 1 with id2==12 .在连接中使用列y允许我们将id1 == 1id2==12链接起来。 However, column y (in addition to x ) will (correctly) not match id1 == 3 with id2==13 , because they have different y values.但是,列y (除了x )将(正确地)不匹配id1 == 3id2==13 ,因为它们具有不同的y值。
  • Using column z in the join allows me to link id1 == 3 with id2==13 , but now we have the same problem as before, now for linking id1 == 1 with id2==12 : they differ in their z values.在连接中使用列z允许我将id1 == 3id2==13链接起来,但现在我们遇到了与以前相同的问题,现在将id1 == 1id2==12链接起来:它们的z值不同。 Moreover, if we join by z then we destroy the link of id1 == 2 with id2==11 (although the example would still work without this extra quirk).此外,如果我们通过z加入,那么我们会破坏id1 == 2id2==11的链接(尽管这个例子在没有这个额外的怪癖的情况下仍然可以工作)。

Therefore, the question is if there is a good and/or succinct way of joining these two dataframes using an OR condition on columns y and z , besides matching strictly on x .因此,问题是,除了在x严格匹配之外,是否有一种好的和/或简洁的方式在yz列上使用OR条件连接这两个数据帧。 Something like就像是

full_join(df1, df2, by = c("x", "y" OR "z"))

My attempt so far involves a series of joins and other manipulations, but it is very cumbersome to read (which makes it prone to errors), and likely too slow for large enough data.到目前为止,我的尝试涉及一系列连接和其他操作,但读取起来非常麻烦(这使它容易出错),而且对于足够大的数据来说可能太慢了。 I'm open to corrections to this method, or to being told that this is the only way to do this (although I really hope it's not).我愿意对此方法进行更正,或者被告知这是唯一的方法(尽管我真的希望不是)。

left_join(df1, df2, by = c("x", "y"), suffix = c("1", "2")) %>% 
  left_join(df2, by = c("x", "z1" = "z"), suffix = c("1", "2")) %>% 
  rowwise() %>% 
  mutate(id2 = sum(id21, id22, na.rm = T)) %>% 
  select(id1, id2)

# # A tibble: 3 x 2
# # Rowwise: 
#     id1   id2
#   <int> <int>
# 1     1    12
# 2     2    11
# 3     3    13

You could use tidyr to bring both data.frames into a long format.您可以使用tidyr将两个 data.frames 转换为长格式。

library(tidyr)
library(dplyr)

df2_2 <- df2 %>% 
  pivot_longer(c(y, z))

df1_1 <- df1 %>% 
  pivot_longer(c(y, z))

Next you inner_join both of your pivoted data.frames by x and the new value column and bring them back into a wide format.接下来,您inner_join都是由你转动data.frames的x和新value列,并把它们放回宽格式。 So所以

df1_1  %>% 
  inner_join(df2_2, by=c("x", "value")) %>% 
  pivot_wider(names_from="name.x") %>% 
  select(-name.y)

returns返回

# A tibble: 3 x 5
    id1 x       id2     y     z
  <int> <chr> <int> <dbl> <dbl>
1     1 b        12    55    NA
2     2 a        11    50    NA
3     3 b        13    NA    69

or要么

df1_1  %>% 
  inner_join(df2_2, by=c("x", "value")) %>% 
  pivot_wider(names_from="name.x") %>% 
  select(id1, id2)

gives you your desired output给你你想要的输出

# A tibble: 3 x 2
    id1   id2
  <int> <int>
1     1    12
2     2    11
3     3    13

This works fine, if y and z are of the same type.如果yz是相同类型,这可以正常工作。 If they are of different types (for example one is a character, the other one a double), you have to convert them into a common format so pivot_longer works.如果它们的类型不同(例如一个是字符,另一个是双pivot_longer ),则必须将它们转换为通用格式,以便pivot_longer起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM