How to join(merge) multiple data frames and keep column names in R

Question

I have four data frames:

df01 <- data.frame(ID = c("001","002","003","004"),
                      Name = c("Ben","Jennifer","Mark","Brad"),
                      LastName = c("Affleck","Lopez","Anthony","Pitt"))
df02 <- data.frame(ID = c("001","002"),
                      Age = c(37,41))
df03 <- data.frame(ID = c("003"),
                      Age = c(28))
df04 <- data.frame(ID = c("004"),
                      Age = c(48))

I am trying to join using dplyr package with the function left_join like this:

df <- df01 %>% 
  left_join(df02, by = "ID") %>% 
  left_join(df03, by = "ID") %>% 
  left_join(df04, by = "ID")

And my current outcome is

> df
   ID     Name LastName Age.x Age.y Age
1 001      Ben  Affleck    37    NA  NA
2 002 Jennifer    Lopez    41    NA  NA
3 003     Mark  Anthony    NA    28  NA
4 004     Brad     Pitt    NA    NA  48

But my expected outcome would be:

> df
   ID     Name LastName Age
1 001      Ben  Affleck  37
2 002 Jennifer    Lopez  41
3 003     Mark  Anthony  28
4 004     Brad     Pitt  48

I would like to say, this is a very simplified issue because one solution would be binding rows and next applying left_join like this

dfx <- bind_rows(df02,df03,df04)

df <- df01 %>% 
  left_join(dfx, by = "ID")

but the real issue includes larger-than-memory and applying that solution would do an error called "Error: cannot allocate vector of size..."

Thank you very much for your help.

Answer 1

Here's a use of Reduce (or you can use purrr::reduce , effectively the same thing):

fun <- function(a, b) {
  out <- left_join(a, b, by = "ID", suffix = c("", ".y"))
  if (all(c("Age", "Age.y") %in% names(out))) {
    out <- mutate(out, Age = coalesce(Age.y, Age)) %>%
      select(-Age.y)
  }
  out
}
Reduce(fun, list(df01, df02, df03, df04))
#    ID     Name LastName Age
# 1 001      Ben  Affleck  37
# 2 002 Jennifer    Lopez  41
# 3 003     Mark  Anthony  28
# 4 004     Brad     Pitt  48

Quick walk-through:

Reduce calls the function ( fun here) on the first two elements of the list provided; it then calls with that return value and the 3rd element in the list; then calls with that return value and the 4th; until the list is exhausted
coalesce is a great function that returns the first non- NA value of the values provided; and it's vectorized, which is great; try coalesce(c(1,NA,3), c(22,33,44)) and get c(1,33,3) .

Answer 2

As you have only one value for each new Age column. You could sum all of them after the left joins.

df <- df01 %>% 
  left_join(df02, by = "ID") %>% 
  left_join(df03, by = "ID") %>% 
  left_join(df04, by = "ID") %>% 
  mutate(Age = sum(c_across(stringr::str_subset(colnames(.), "Age")), na.rm = TRUE)) %>% 
  select(-stringr::str_subset(colnames(.), "Age."))

With stringr package you can select all columns that have "Age" in their name. Then you could do the same to remove all columns with "Age."

Answer 3

Generally, join operations are used to combine tables with different column sets. Here df02, df03 and df04 have all the same columns and seem to require row binding, rather than joining. I would do like this:

> bind_rows(df02, df03, df04) %>% left_join(df01, ., by = "ID")
   ID     Name LastName Age
1 001      Ben  Affleck  37
2 002 Jennifer    Lopez  41
3 003     Mark  Anthony  28
4 004     Brad     Pitt  48

In case you are not sure that the IDs in those tables are unique, you need to decide what to do with duplicates. %>% group_by(ID) %>% summarize(Age = first(Age)) before the left join would select the first age among duplicate IDs, if any.

How to join(merge) multiple data frames and keep column names in R

Question

3 answers

solution1
0 2022-08-03 12:43:52

solution2
0 2022-08-03 13:28:42

solution3
0 2022-08-03 17:23:49

How to join(merge) multiple data frames and keep column names in R

Question

3 answers

solution1 0 2022-08-03 12:43:52

solution2 0 2022-08-03 13:28:42

solution3 0 2022-08-03 17:23:49

solution1
0 2022-08-03 12:43:52

solution2
0 2022-08-03 13:28:42

solution3
0 2022-08-03 17:23:49