简体   繁体   中英

How to join(merge) multiple data frames and keep column names in R

I have four data frames:

df01 <- data.frame(ID = c("001","002","003","004"),
                      Name = c("Ben","Jennifer","Mark","Brad"),
                      LastName = c("Affleck","Lopez","Anthony","Pitt"))
df02 <- data.frame(ID = c("001","002"),
                      Age = c(37,41))
df03 <- data.frame(ID = c("003"),
                      Age = c(28))
df04 <- data.frame(ID = c("004"),
                      Age = c(48))

I am trying to join using dplyr package with the function left_join like this:

df <- df01 %>% 
  left_join(df02, by = "ID") %>% 
  left_join(df03, by = "ID") %>% 
  left_join(df04, by = "ID")

And my current outcome is

> df
   ID     Name LastName Age.x Age.y Age
1 001      Ben  Affleck    37    NA  NA
2 002 Jennifer    Lopez    41    NA  NA
3 003     Mark  Anthony    NA    28  NA
4 004     Brad     Pitt    NA    NA  48

But my expected outcome would be:

> df
   ID     Name LastName Age
1 001      Ben  Affleck  37
2 002 Jennifer    Lopez  41
3 003     Mark  Anthony  28
4 004     Brad     Pitt  48

I would like to say, this is a very simplified issue because one solution would be binding rows and next applying left_join like this

dfx <- bind_rows(df02,df03,df04)

df <- df01 %>% 
  left_join(dfx, by = "ID")

but the real issue includes larger-than-memory and applying that solution would do an error called "Error: cannot allocate vector of size..."

Thank you very much for your help.

Here's a use of Reduce (or you can use purrr::reduce , effectively the same thing):

fun <- function(a, b) {
  out <- left_join(a, b, by = "ID", suffix = c("", ".y"))
  if (all(c("Age", "Age.y") %in% names(out))) {
    out <- mutate(out, Age = coalesce(Age.y, Age)) %>%
      select(-Age.y)
  }
  out
}
Reduce(fun, list(df01, df02, df03, df04))
#    ID     Name LastName Age
# 1 001      Ben  Affleck  37
# 2 002 Jennifer    Lopez  41
# 3 003     Mark  Anthony  28
# 4 004     Brad     Pitt  48

Quick walk-through:

  • Reduce calls the function ( fun here) on the first two elements of the list provided; it then calls with that return value and the 3rd element in the list; then calls with that return value and the 4th; until the list is exhausted
  • coalesce is a great function that returns the first non- NA value of the values provided; and it's vectorized, which is great; try coalesce(c(1,NA,3), c(22,33,44)) and get c(1,33,3) .

As you have only one value for each new Age column. You could sum all of them after the left joins.

df <- df01 %>% 
  left_join(df02, by = "ID") %>% 
  left_join(df03, by = "ID") %>% 
  left_join(df04, by = "ID") %>% 
  mutate(Age = sum(c_across(stringr::str_subset(colnames(.), "Age")), na.rm = TRUE)) %>% 
  select(-stringr::str_subset(colnames(.), "Age."))

With stringr package you can select all columns that have "Age" in their name. Then you could do the same to remove all columns with "Age."

Generally, join operations are used to combine tables with different column sets. Here df02, df03 and df04 have all the same columns and seem to require row binding, rather than joining. I would do like this:

> bind_rows(df02, df03, df04) %>% left_join(df01, ., by = "ID")
   ID     Name LastName Age
1 001      Ben  Affleck  37
2 002 Jennifer    Lopez  41
3 003     Mark  Anthony  28
4 004     Brad     Pitt  48

In case you are not sure that the IDs in those tables are unique, you need to decide what to do with duplicates. %>% group_by(ID) %>% summarize(Age = first(Age)) before the left join would select the first age among duplicate IDs, if any.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM