I have four data frames:
df01 <- data.frame(ID = c("001","002","003","004"),
Name = c("Ben","Jennifer","Mark","Brad"),
LastName = c("Affleck","Lopez","Anthony","Pitt"))
df02 <- data.frame(ID = c("001","002"),
Age = c(37,41))
df03 <- data.frame(ID = c("003"),
Age = c(28))
df04 <- data.frame(ID = c("004"),
Age = c(48))
I am trying to join using dplyr package with the function left_join like this:
df <- df01 %>%
left_join(df02, by = "ID") %>%
left_join(df03, by = "ID") %>%
left_join(df04, by = "ID")
And my current outcome is
> df
ID Name LastName Age.x Age.y Age
1 001 Ben Affleck 37 NA NA
2 002 Jennifer Lopez 41 NA NA
3 003 Mark Anthony NA 28 NA
4 004 Brad Pitt NA NA 48
But my expected outcome would be:
> df
ID Name LastName Age
1 001 Ben Affleck 37
2 002 Jennifer Lopez 41
3 003 Mark Anthony 28
4 004 Brad Pitt 48
I would like to say, this is a very simplified issue because one solution would be binding rows and next applying left_join like this
dfx <- bind_rows(df02,df03,df04)
df <- df01 %>%
left_join(dfx, by = "ID")
but the real issue includes larger-than-memory and applying that solution would do an error called "Error: cannot allocate vector of size..."
Thank you very much for your help.
Here's a use of Reduce
(or you can use purrr::reduce
, effectively the same thing):
fun <- function(a, b) {
out <- left_join(a, b, by = "ID", suffix = c("", ".y"))
if (all(c("Age", "Age.y") %in% names(out))) {
out <- mutate(out, Age = coalesce(Age.y, Age)) %>%
select(-Age.y)
}
out
}
Reduce(fun, list(df01, df02, df03, df04))
# ID Name LastName Age
# 1 001 Ben Affleck 37
# 2 002 Jennifer Lopez 41
# 3 003 Mark Anthony 28
# 4 004 Brad Pitt 48
Quick walk-through:
Reduce
calls the function ( fun
here) on the first two elements of the list provided; it then calls with that return value and the 3rd element in the list; then calls with that return value and the 4th; until the list is exhaustedcoalesce
is a great function that returns the first non- NA
value of the values provided; and it's vectorized, which is great; try coalesce(c(1,NA,3), c(22,33,44))
and get c(1,33,3)
.As you have only one value for each new Age
column. You could sum all of them after the left joins.
df <- df01 %>%
left_join(df02, by = "ID") %>%
left_join(df03, by = "ID") %>%
left_join(df04, by = "ID") %>%
mutate(Age = sum(c_across(stringr::str_subset(colnames(.), "Age")), na.rm = TRUE)) %>%
select(-stringr::str_subset(colnames(.), "Age."))
With stringr
package you can select all columns that have "Age" in their name. Then you could do the same to remove all columns with "Age."
Generally, join operations are used to combine tables with different column sets. Here df02, df03 and df04 have all the same columns and seem to require row binding, rather than joining. I would do like this:
> bind_rows(df02, df03, df04) %>% left_join(df01, ., by = "ID")
ID Name LastName Age
1 001 Ben Affleck 37
2 002 Jennifer Lopez 41
3 003 Mark Anthony 28
4 004 Brad Pitt 48
In case you are not sure that the IDs in those tables are unique, you need to decide what to do with duplicates. %>% group_by(ID) %>% summarize(Age = first(Age))
before the left join would select the first age among duplicate IDs, if any.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.