简体   繁体   中英

merge/combine columns with same name but incomplete data

I have two data frames that have some columns with the same names and others with different names. The data frames look something like this:

df1
      ID hello world hockey soccer
    1  1    NA    NA      7      4
    2  2    NA    NA      2      5
    3  3    10     8      8     23
    4  4     4    17      5     12
    5  5    NA    NA      3     43

df2    
      ID hello world football baseball
    1  1     2     3       43        6
    2  2     5     1       24       32
    3  3    NA    NA        2       23
    4  4    NA    NA        5       15
    5  5     9     7       12       23

As you can see, in 2 of the shared columns ("hello" and "world"), some of the data is in one of the data frames and the rest is in the other.

What I am trying to do is (1) merge the 2 data frames by "id", (2) combine all the data from the "hello" and "world" columns in both frames into 1 "hello" column and 1 "world" column, and (3) have the final data frame also contain all of the other columns in the 2 original frames ("hockey", "soccer", "football", "baseball"). So, I want the final result to be this:

  ID hello world hockey soccer football baseball
1  1     2     3      7      4        43       6
2  2     5     3      2      5        24      32
3  3    10     8      8     23         2      23
4  4     4    17      5     12         5      15
5  5     9     7      3     43        12      23

I'm pretty new at R so the only codes I've tried are variations on merge and I've tried the answer I found here, which was based on a similar question: R: merging copies of the same variable . However, my data sets are actually much bigger than what I'm showing here (there's about 20 matching columns (like "hello" and "world") and 100s of non-matching ones (like "hockey" and "football")) so I'm looking for something that won't require me to write them all out manually.

Any idea if this can be done? I'm sorry I can't provide a sample of my efforts, but I really don't know where to start besides:

mydata <- merge(df1, df2, by=c("ID"), all = TRUE)

To reproduce the data frames:

df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9), 
       world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12), 
       baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world", 
       "football", "baseball"), class = "data.frame", row.names = c(NA, -5L)) 

df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA), 
       world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3), 
       soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey", 
       "soccer"), class = "data.frame", row.names = c(NA, -5L))

Here's an approach that involves melt ing your data, merging the molten data, and using dcast to get it back to a wide form. I've added comments to help understand what is going on.

## Required packages
library(data.table)
library(reshape2)

dcast.data.table(
  merge(
    ## melt the first data.frame and set the key as ID and variable
    setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable), 
    ## melt the second data.frame
    melt(as.data.table(df2), id.vars = "ID"), 
    ## you'll have 2 value columns...
    all = TRUE)[, value := ifelse(
      ## ... combine them into 1 with ifelse
      is.na(value.x), value.y, value.x)], 
  ## This is your reshaping formula
  ID ~ variable, value.var = "value")
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43

Nobody's posted a dplyr solution, so here's a succinct option in dplyr . The approach is simply to do a full_join that combines all rows, then group and summarise to remove the redundant missing cells.

library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

df1 %>%
  full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>%
  group_by(ID) %>%
  summarize_all(na.omit)
#> # A tibble: 5 x 7
#>      ID hello world hockey soccer football baseball
#>   <int> <int> <int>  <int>  <int>    <int>    <int>
#> 1     1     2     3      7      4       43        6
#> 2     2     5     1      2      5       24       32
#> 3     3    10     8      8     23        2       23
#> 4     4     4    17      5     12        5       15
#> 5     5     9     7      3     43       12        2

Created on 2018-07-13 by the reprex package (v0.2.0).

Here's another data.table approach using binary merge

library(data.table)
setkey(setDT(df1), ID) ; setkey(setDT(df2), ID) # Converting to data.table objects and setting keys
df1 <- df1[df2][, `:=`(i.hello = NULL, i.world = NULL)] # Full left join
df1[df2[complete.cases(df2)], `:=`(hello = i.hello, world = i.world)][] # Joining only on non-missing values
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43

@ananda-mahto 's answer is more elegant but here is my suggestion:

library(reshape2)
df1=melt(df1,id='ID',na.rm=TRUE)
df2=melt(df2,id='ID',na.rm=TRUE)
DF=rbind(df1,df2)
# Not needeed,  added na.rm=TRUE based on @ananda-mahto's valid comment
# DF<-DF[!is.na(DF$value),]
dcast(DF,ID~variable,value.var='value')

Here is a more tidyr centric approach that does something similar to the currently accepted answer. The approach is simply to stack the data frames on top of each other with bind_rows (which matches column names), gather up all the non ID columns with na.rm = TRUE , and then spread them back out. This should be robust to situations where the condition "if the value is NA in "df1" it would have a value in "df2" (and vice versa)" doesn't always hold, compared to a summarise option.

library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

df1 %>%
  bind_rows(df2) %>%
  gather(variable, value, -ID, na.rm = TRUE) %>%
  spread(variable, value)
#> # A tibble: 5 x 7
#>      ID baseball football hello hockey soccer world
#>   <int>    <int>    <int> <int>  <int>  <int> <int>
#> 1     1        6       43     2      7      4     3
#> 2     2       32       24     5      2      5     1
#> 3     3       23        2    10      8     23     8
#> 4     4       15        5     4      5     12    17
#> 5     5        2       12     9      3     43     7

Created on 2018-07-13 by the reprex package (v0.2.0).

Using tidyverse we could use coalesce .

None of the solutions below builds extra rows, data stays more or less of the same size and similar shape throughout the chain.

Solution 1

list(df1,df2) %>%
  transpose(union(names(df1),names(df2))) %>%
  map_dfc(. %>% compact %>% invoke(coalesce,.))

# # A tibble: 5 x 7
#      ID hello world football baseball hockey soccer
#   <int> <dbl> <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
# 1     1     2     3       43        6      7      4
# 2     2     5     1       24       32      2      5
# 3     3    10     8        2       23      8     23
# 4     4     4    17        5       15      5     12
# 5     5     9     7       12       23      3     43

Explanations

  • Wrap both data frames into a list
  • transpose it, so each new item at the root has the name of a column of the output. Default behavior of transpose is to take the first argument as a template so unfortunately we have to be explicit to get all of them.
  • compact these items, as they were all of length 2, but with one of them being NULL when the given column was missing on one side.
  • coalesce those, which basically means return the first non NA you find, when putting arguments side by side.

if repeating df1 and df2 on the second line is an issue, use the following instead:

transpose(invoke(union, setNames(map(., names), c("x","y"))))

Solution 2

Same philosophy, but this time we loop on names:

map_dfc(set_names(union(names(df1), names(df2))),
        ~ invoke(coalesce, compact(list(df1[[.x]], df2[[.x]]))))

# # A tibble: 5 x 7
#      ID hello world football baseball hockey soccer
#   <int> <dbl> <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
# 1     1     2     3       43        6      7      4
# 2     2     5     1       24       32      2      5
# 3     3    10     8        2       23      8     23
# 4     4     4    17        5       15      5     12
# 5     5     9     7       12       23      3     43

Here it is once pipified for those who may prefer :

union(names(df1), names(df2)) %>%
  set_names %>%
  map_dfc(~ list(df1[[.x]], df2[[.x]]) %>%
            compact %>%
            invoke(coalesce, .))

Explanations

  • set_names gives to character vector names identical to its values, so map_dfc can name the output's columns right.
  • df1[[.x]] will return NULL when .x is not a column of df1 , we take advantage of this.
  • df1 and df2 are mentioned 2 times each and I can't think of any way around it.

Solution 1 is cleaner in respect to these points so I recommend it.

We could use my package safejoin , do a left join and deal with the conflicts using dplyr::coalesce

# # devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)

safe_left_join(df1, df2, by = "ID", conflict = coalesce)
# # A tibble: 5 x 7
#      ID hello world hockey soccer football baseball
#   <int> <int> <int>  <int>  <int>    <int>    <int>
# 1     1     2     3      7      4       43        6
# 2     2     5     1      2      5       24       32
# 3     3    10     8      8     23        2       23
# 4     4     4    17      5     12        5       15
# 5     5     9     7      3     43       12        2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM