简体   繁体   中英

Join dataframes using an OR condition for columns to match by

Suppose we have two dataframes we want to join. For the sake of simplicity, suppose both have the same number of rows, and that each row in one dataframe has to have a unique corresponding row in the other dataframe. Here is a MWE of this setup:

library(dplyr)
df1 <- tibble(id1 = 1:3, x = c("b", "a", "b"), y = c(55, 50, 58), z = c(65, 60, 69))
df2 <- tibble(id2 = 11:13, x = c("a", "b", "b"), y = c(50, 55, 59), z = c(61, 66, 69))

# # A tibble: 3 x 4
#     id1 x         y     z
#   <int> <chr> <dbl> <dbl>
# 1     1 b        55    65
# 2     2 a        50    60
# 3     3 b        58    69
# # A tibble: 3 x 4
#     id2 x         y     z
#   <int> <chr> <dbl> <dbl>
# 1    11 a        50    61
# 2    12 b        55    66
# 3    13 b        59    69

The goal is to link the unique IDs in df1 with the unique but different IDs in df2 . I can do so by exploiting the fact that the combined information in columns x , y , and z is sufficient to uniquely identify the rows. The problem, explained in detail below, is that the join would work correctly in this setup if we were able to join by something like c("x", "y" OR "z") .

  • Starting from column x , we have enough information to link id1 == 2 with id2==11 , because they share the same unique value (ie "a" ). However, column x alone is not enough for the remaining pairs.
  • Using column y in the join allows us to link id1 == 1 with id2==12 . However, column y (in addition to x ) will (correctly) not match id1 == 3 with id2==13 , because they have different y values.
  • Using column z in the join allows me to link id1 == 3 with id2==13 , but now we have the same problem as before, now for linking id1 == 1 with id2==12 : they differ in their z values. Moreover, if we join by z then we destroy the link of id1 == 2 with id2==11 (although the example would still work without this extra quirk).

Therefore, the question is if there is a good and/or succinct way of joining these two dataframes using an OR condition on columns y and z , besides matching strictly on x . Something like

full_join(df1, df2, by = c("x", "y" OR "z"))

My attempt so far involves a series of joins and other manipulations, but it is very cumbersome to read (which makes it prone to errors), and likely too slow for large enough data. I'm open to corrections to this method, or to being told that this is the only way to do this (although I really hope it's not).

left_join(df1, df2, by = c("x", "y"), suffix = c("1", "2")) %>% 
  left_join(df2, by = c("x", "z1" = "z"), suffix = c("1", "2")) %>% 
  rowwise() %>% 
  mutate(id2 = sum(id21, id22, na.rm = T)) %>% 
  select(id1, id2)

# # A tibble: 3 x 2
# # Rowwise: 
#     id1   id2
#   <int> <int>
# 1     1    12
# 2     2    11
# 3     3    13

You could use tidyr to bring both data.frames into a long format.

library(tidyr)
library(dplyr)

df2_2 <- df2 %>% 
  pivot_longer(c(y, z))

df1_1 <- df1 %>% 
  pivot_longer(c(y, z))

Next you inner_join both of your pivoted data.frames by x and the new value column and bring them back into a wide format. So

df1_1  %>% 
  inner_join(df2_2, by=c("x", "value")) %>% 
  pivot_wider(names_from="name.x") %>% 
  select(-name.y)

returns

# A tibble: 3 x 5
    id1 x       id2     y     z
  <int> <chr> <int> <dbl> <dbl>
1     1 b        12    55    NA
2     2 a        11    50    NA
3     3 b        13    NA    69

or

df1_1  %>% 
  inner_join(df2_2, by=c("x", "value")) %>% 
  pivot_wider(names_from="name.x") %>% 
  select(id1, id2)

gives you your desired output

# A tibble: 3 x 2
    id1   id2
  <int> <int>
1     1    12
2     2    11
3     3    13

This works fine, if y and z are of the same type. If they are of different types (for example one is a character, the other one a double), you have to convert them into a common format so pivot_longer works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM