[英]Join dataframes using an OR condition for columns to match by
Suppose we have two dataframes we want to join.假设我们有两个要加入的数据框。 For the sake of simplicity, suppose both have the same number of rows, and that each row in one dataframe has to have a unique corresponding row in the other dataframe.
为简单起见,假设两者具有相同的行数,并且一个数据帧中的每一行必须在另一个数据帧中具有唯一的对应行。 Here is a MWE of this setup:
这是此设置的 MWE:
library(dplyr)
df1 <- tibble(id1 = 1:3, x = c("b", "a", "b"), y = c(55, 50, 58), z = c(65, 60, 69))
df2 <- tibble(id2 = 11:13, x = c("a", "b", "b"), y = c(50, 55, 59), z = c(61, 66, 69))
# # A tibble: 3 x 4
# id1 x y z
# <int> <chr> <dbl> <dbl>
# 1 1 b 55 65
# 2 2 a 50 60
# 3 3 b 58 69
# # A tibble: 3 x 4
# id2 x y z
# <int> <chr> <dbl> <dbl>
# 1 11 a 50 61
# 2 12 b 55 66
# 3 13 b 59 69
The goal is to link the unique IDs in df1
with the unique but different IDs in df2
.目标是将
df1
中的唯一 ID 与df2
唯一但不同的 ID 链接起来。 I can do so by exploiting the fact that the combined information in columns x
, y
, and z
is sufficient to uniquely identify the rows.我可以通过利用
x
、 y
和z
列中的组合信息足以唯一地标识行这一事实来做到这一点。 The problem, explained in detail below, is that the join would work correctly in this setup if we were able to join by something like c("x", "y" OR "z")
.下面详细解释的问题是,如果我们能够通过诸如
c("x", "y" OR "z")
类的内容进行连接,连接将在此设置中正常工作。
x
, we have enough information to link id1 == 2
with id2==11
, because they share the same unique value (ie "a"
).x
列开始,我们有足够的信息将id1 == 2
与id2==11
,因为它们共享相同的唯一值(即"a"
)。 However, column x
alone is not enough for the remaining pairs.x
是不够的。y
in the join allows us to link id1 == 1
with id2==12
.y
允许我们将id1 == 1
与id2==12
链接起来。 However, column y
(in addition to x
) will (correctly) not match id1 == 3
with id2==13
, because they have different y
values.y
(除了x
)将(正确地)不匹配id1 == 3
与id2==13
,因为它们具有不同的y
值。z
in the join allows me to link id1 == 3
with id2==13
, but now we have the same problem as before, now for linking id1 == 1
with id2==12
: they differ in their z
values.z
允许我将id1 == 3
与id2==13
链接起来,但现在我们遇到了与以前相同的问题,现在将id1 == 1
与id2==12
链接起来:它们的z
值不同。 Moreover, if we join by z
then we destroy the link of id1 == 2
with id2==11
(although the example would still work without this extra quirk).z
加入,那么我们会破坏id1 == 2
与id2==11
的链接(尽管这个例子在没有这个额外的怪癖的情况下仍然可以工作)。 Therefore, the question is if there is a good and/or succinct way of joining these two dataframes using an OR
condition on columns y
and z
, besides matching strictly on x
.因此,问题是,除了在
x
严格匹配之外,是否有一种好的和/或简洁的方式在y
和z
列上使用OR
条件连接这两个数据帧。 Something like就像是
full_join(df1, df2, by = c("x", "y" OR "z"))
My attempt so far involves a series of joins and other manipulations, but it is very cumbersome to read (which makes it prone to errors), and likely too slow for large enough data.到目前为止,我的尝试涉及一系列连接和其他操作,但读取起来非常麻烦(这使它容易出错),而且对于足够大的数据来说可能太慢了。 I'm open to corrections to this method, or to being told that this is the only way to do this (although I really hope it's not).
我愿意对此方法进行更正,或者被告知这是唯一的方法(尽管我真的希望不是)。
left_join(df1, df2, by = c("x", "y"), suffix = c("1", "2")) %>%
left_join(df2, by = c("x", "z1" = "z"), suffix = c("1", "2")) %>%
rowwise() %>%
mutate(id2 = sum(id21, id22, na.rm = T)) %>%
select(id1, id2)
# # A tibble: 3 x 2
# # Rowwise:
# id1 id2
# <int> <int>
# 1 1 12
# 2 2 11
# 3 3 13
You could use tidyr
to bring both data.frames into a long format.您可以使用
tidyr
将两个 data.frames 转换为长格式。
library(tidyr)
library(dplyr)
df2_2 <- df2 %>%
pivot_longer(c(y, z))
df1_1 <- df1 %>%
pivot_longer(c(y, z))
Next you inner_join
both of your pivoted data.frames by x
and the new value
column and bring them back into a wide format.接下来,您
inner_join
都是由你转动data.frames的x
和新value
列,并把它们放回宽格式。 So所以
df1_1 %>%
inner_join(df2_2, by=c("x", "value")) %>%
pivot_wider(names_from="name.x") %>%
select(-name.y)
returns返回
# A tibble: 3 x 5
id1 x id2 y z
<int> <chr> <int> <dbl> <dbl>
1 1 b 12 55 NA
2 2 a 11 50 NA
3 3 b 13 NA 69
or要么
df1_1 %>%
inner_join(df2_2, by=c("x", "value")) %>%
pivot_wider(names_from="name.x") %>%
select(id1, id2)
gives you your desired output给你你想要的输出
# A tibble: 3 x 2
id1 id2
<int> <int>
1 1 12
2 2 11
3 3 13
This works fine, if y
and z
are of the same type.如果
y
和z
是相同类型,这可以正常工作。 If they are of different types (for example one is a character, the other one a double), you have to convert them into a common format so pivot_longer
works.如果它们的类型不同(例如一个是字符,另一个是双
pivot_longer
),则必须将它们转换为通用格式,以便pivot_longer
起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.