简体   繁体   中英

Performing a dplyr full_join without a common variable to blend data frames

Using the dplyr full_join() operation, I am trying to perform the equivalent of a basic merge() operation in which no common variables exist (unable to satisfy the "by=" argument). This will blend two data frames and return all possible combinations.

However, the current full_join() function requires a common variable. I am unable to locate another dplyr function that can help with this. How can I perform this operation using functions specific to the dplyr library?

df_a = data.frame(department=c(1,2,3,4))
df_b = data.frame(period=c(2014,2015,2016,2017))

#This works as desired
big_df = merge(df_a,df_b)

#I'd like to perform the following in a much bigger operation:
big_df = dplyr::full_join(df_a,df_b)

#Error: No common variables. Please specify `by` param.

You can use crossing from tidyr :

crossing(df_a,df_b)

   department period
1           1   2014
2           1   2015
3           1   2016
4           1   2017
5           2   2014
6           2   2015
7           2   2016
8           2   2017
9           3   2014
10          3   2015
11          3   2016
12          3   2017
13          4   2014
14          4   2015
15          4   2016
16          4   2017

If there are duplicate rows, crossing doesn't give the same result as merge .

Instead use full_join with by = character() to perform a cross-join which generates all combinations of df_a and df_b .

library("tidyverse") # version 1.3.2

# Add duplicate rows for illustration.
df_a <- tibble(department = c(1, 2, 3, 3))
df_b <- tibble(period = c(2014, 2015, 2016, 2017))

merge doesn't de-duplicate.

df_a_merge_b <- merge(df_a, df_b)
df_a_merge_b
#>    department period
#> 1           1   2014
#> 2           2   2014
#> 3           3   2014
#> 4           3   2014
#> 5           1   2015
#> 6           2   2015
#> 7           3   2015
#> 8           3   2015
#> 9           1   2016
#> 10          2   2016
#> 11          3   2016
#> 12          3   2016
#> 13          1   2017
#> 14          2   2017
#> 15          3   2017
#> 16          3   2017

crossing drops duplicate rows.

df_a_crossing_b <- crossing(df_a, df_b)
df_a_crossing_b
#> # A tibble: 12 × 2
#>    department period
#>         <dbl>  <dbl>
#>  1          1   2014
#>  2          1   2015
#>  3          1   2016
#>  4          1   2017
#>  5          2   2014
#>  6          2   2015
#>  7          2   2016
#>  8          2   2017
#>  9          3   2014
#> 10          3   2015
#> 11          3   2016
#> 12          3   2017

full_join doesn't remove duplicates either.

df_a_full_join_b <- full_join(df_a, df_b, by = character())
df_a_full_join_b
#> # A tibble: 16 × 2
#>    department period
#>         <dbl>  <dbl>
#>  1          1   2014
#>  2          1   2015
#>  3          1   2016
#>  4          1   2017
#>  5          2   2014
#>  6          2   2015
#>  7          2   2016
#>  8          2   2017
#>  9          3   2014
#> 10          3   2015
#> 11          3   2016
#> 12          3   2017
#> 13          3   2014
#> 14          3   2015
#> 15          3   2016
#> 16          3   2017

packageVersion("tidyverse")
#> [1] '1.3.2'

Created on 2023-01-13 with reprex v2.0.2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM