简体   繁体   English

合并/组合具有相同名称但不完整数据的列

[英]merge/combine columns with same name but incomplete data

I have two data frames that have some columns with the same names and others with different names. 我有两个数据框,其中一些列具有相同的名称,另一些具有不同的名称。 The data frames look something like this: 数据框看起来像这样:

df1
      ID hello world hockey soccer
    1  1    NA    NA      7      4
    2  2    NA    NA      2      5
    3  3    10     8      8     23
    4  4     4    17      5     12
    5  5    NA    NA      3     43

df2    
      ID hello world football baseball
    1  1     2     3       43        6
    2  2     5     1       24       32
    3  3    NA    NA        2       23
    4  4    NA    NA        5       15
    5  5     9     7       12       23

As you can see, in 2 of the shared columns ("hello" and "world"), some of the data is in one of the data frames and the rest is in the other. 如您所见,在2个共享列(“hello”和“world”)中,一些数据位于其中一个数据框中,其余数据位于另一个数据框中。

What I am trying to do is (1) merge the 2 data frames by "id", (2) combine all the data from the "hello" and "world" columns in both frames into 1 "hello" column and 1 "world" column, and (3) have the final data frame also contain all of the other columns in the 2 original frames ("hockey", "soccer", "football", "baseball"). 我要做的是(1)通过“id”合并2个数据帧,(2)将两个帧中“hello”和“world”列的所有数据合并为1个“hello”列和1个“world” “列,和(3)最后的数据框还包含2个原始帧中的所有其他列(”曲棍球“,”足球“,”足球“,”棒球“)。 So, I want the final result to be this: 所以,我希望最终的结果如下:

  ID hello world hockey soccer football baseball
1  1     2     3      7      4        43       6
2  2     5     3      2      5        24      32
3  3    10     8      8     23         2      23
4  4     4    17      5     12         5      15
5  5     9     7      3     43        12      23

I'm pretty new at R so the only codes I've tried are variations on merge and I've tried the answer I found here, which was based on a similar question: R: merging copies of the same variable . 我是R的新手,所以我尝试的唯一代码是merge变化,我尝试了我在这里找到的答案,这是基于一个类似的问题: R:合并同一个变量的副本 However, my data sets are actually much bigger than what I'm showing here (there's about 20 matching columns (like "hello" and "world") and 100s of non-matching ones (like "hockey" and "football")) so I'm looking for something that won't require me to write them all out manually. 但是,我的数据集实际上比我在这里显示的要大得多(大约有20个匹配的列(如“hello”和“world”)和100个不匹配的列(如“曲棍球”和“足球”))所以我正在寻找一些不需要我手动编写的东西。

Any idea if this can be done? 有什么想法可以做到吗? I'm sorry I can't provide a sample of my efforts, but I really don't know where to start besides: 对不起,我无法提供我的努力样本,但我真的不知道从哪里开始:

mydata <- merge(df1, df2, by=c("ID"), all = TRUE)

To reproduce the data frames: 要重现数据框:

df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9), 
       world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12), 
       baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world", 
       "football", "baseball"), class = "data.frame", row.names = c(NA, -5L)) 

df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA), 
       world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3), 
       soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey", 
       "soccer"), class = "data.frame", row.names = c(NA, -5L))

Here's an approach that involves melt ing your data, merging the molten data, and using dcast to get it back to a wide form. 这是一种方法,涉及melt数据,合并熔融数据,并使用dcast将其恢复为宽泛的形式。 I've added comments to help understand what is going on. 我添加了评论以帮助了解正在发生的事情。

## Required packages
library(data.table)
library(reshape2)

dcast.data.table(
  merge(
    ## melt the first data.frame and set the key as ID and variable
    setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable), 
    ## melt the second data.frame
    melt(as.data.table(df2), id.vars = "ID"), 
    ## you'll have 2 value columns...
    all = TRUE)[, value := ifelse(
      ## ... combine them into 1 with ifelse
      is.na(value.x), value.y, value.x)], 
  ## This is your reshaping formula
  ID ~ variable, value.var = "value")
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43

Nobody's posted a dplyr solution, so here's a succinct option in dplyr . 没有人的贴dplyr的解决方案,所以这里是一个简洁的选项dplyr The approach is simply to do a full_join that combines all rows, then group and summarise to remove the redundant missing cells. 该方法只是执行一个组合所有行的full_join ,然后进行groupsummarise以删除冗余的缺失单元格。

library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

df1 %>%
  full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>%
  group_by(ID) %>%
  summarize_all(na.omit)
#> # A tibble: 5 x 7
#>      ID hello world hockey soccer football baseball
#>   <int> <int> <int>  <int>  <int>    <int>    <int>
#> 1     1     2     3      7      4       43        6
#> 2     2     5     1      2      5       24       32
#> 3     3    10     8      8     23        2       23
#> 4     4     4    17      5     12        5       15
#> 5     5     9     7      3     43       12        2

Created on 2018-07-13 by the reprex package (v0.2.0). reprex包创建于2018-07-13(v0.2.0)。

Here's another data.table approach using binary merge 这是使用二进制合并的另一种data.table方法

library(data.table)
setkey(setDT(df1), ID) ; setkey(setDT(df2), ID) # Converting to data.table objects and setting keys
df1 <- df1[df2][, `:=`(i.hello = NULL, i.world = NULL)] # Full left join
df1[df2[complete.cases(df2)], `:=`(hello = i.hello, world = i.world)][] # Joining only on non-missing values
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43

@ananda-mahto 's answer is more elegant but here is my suggestion: @ ananda-mahto的答案更优雅,但这是我的建议:

library(reshape2)
df1=melt(df1,id='ID',na.rm=TRUE)
df2=melt(df2,id='ID',na.rm=TRUE)
DF=rbind(df1,df2)
# Not needeed,  added na.rm=TRUE based on @ananda-mahto's valid comment
# DF<-DF[!is.na(DF$value),]
dcast(DF,ID~variable,value.var='value')

Here is a more tidyr centric approach that does something similar to the currently accepted answer. 这是一个更加以tidyr为中心的方法,它做了类似于当前接受的答案。 The approach is simply to stack the data frames on top of each other with bind_rows (which matches column names), gather up all the non ID columns with na.rm = TRUE , and then spread them back out. 方法只是使用bind_rows (匹配列名称)将数据框堆叠gather ,使用na.rm = TRUE gather所有非ID列,然后spread它们展开。 This should be robust to situations where the condition "if the value is NA in "df1" it would have a value in "df2" (and vice versa)" doesn't always hold, compared to a summarise option. 对于条件“如果值为NA in”df1“它将具有”df2“中的值(反之亦然)”与summarise选项相比“并不总是成立的情况,这应该是稳健的。

library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

df1 %>%
  bind_rows(df2) %>%
  gather(variable, value, -ID, na.rm = TRUE) %>%
  spread(variable, value)
#> # A tibble: 5 x 7
#>      ID baseball football hello hockey soccer world
#>   <int>    <int>    <int> <int>  <int>  <int> <int>
#> 1     1        6       43     2      7      4     3
#> 2     2       32       24     5      2      5     1
#> 3     3       23        2    10      8     23     8
#> 4     4       15        5     4      5     12    17
#> 5     5        2       12     9      3     43     7

Created on 2018-07-13 by the reprex package (v0.2.0). reprex包创建于2018-07-13(v0.2.0)。

Using tidyverse we could use coalesce . 使用tidyverse我们可以使用coalesce

None of the solutions below builds extra rows, data stays more or less of the same size and similar shape throughout the chain. 下面的解决方案都没有构建额外的行,数据在整个链中保持大致相同的大小和相似的形状。

Solution 1 解决方案1

list(df1,df2) %>%
  transpose(union(names(df1),names(df2))) %>%
  map_dfc(. %>% compact %>% invoke(coalesce,.))

# # A tibble: 5 x 7
#      ID hello world football baseball hockey soccer
#   <int> <dbl> <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
# 1     1     2     3       43        6      7      4
# 2     2     5     1       24       32      2      5
# 3     3    10     8        2       23      8     23
# 4     4     4    17        5       15      5     12
# 5     5     9     7       12       23      3     43

Explanations 说明

  • Wrap both data frames into a list 将两个数据帧包装到一个list
  • transpose it, so each new item at the root has the name of a column of the output. transpose它,因此根目录中的每个新项目都具有输出列的名称。 Default behavior of transpose is to take the first argument as a template so unfortunately we have to be explicit to get all of them. transpose默认行为是将第一个参数作为模板,所以不幸的是我们必须明确地获取所有这些参数。
  • compact these items, as they were all of length 2, but with one of them being NULL when the given column was missing on one side. compact这些项目,因为它们都是长度为2,但是当一侧缺少给定列时,其中一个为NULL
  • coalesce those, which basically means return the first non NA you find, when putting arguments side by side. coalesce那些,这基本上意味着在并排放置参数时返回你找到的第一个非NA

if repeating df1 and df2 on the second line is an issue, use the following instead: 如果在第二行重复df1df2是个问题,请使用以下代码:

transpose(invoke(union, setNames(map(., names), c("x","y"))))

Solution 2 解决方案2

Same philosophy, but this time we loop on names: 同样的哲学,但这一次我们循环名称:

map_dfc(set_names(union(names(df1), names(df2))),
        ~ invoke(coalesce, compact(list(df1[[.x]], df2[[.x]]))))

# # A tibble: 5 x 7
#      ID hello world football baseball hockey soccer
#   <int> <dbl> <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
# 1     1     2     3       43        6      7      4
# 2     2     5     1       24       32      2      5
# 3     3    10     8        2       23      8     23
# 4     4     4    17        5       15      5     12
# 5     5     9     7       12       23      3     43

Here it is once pipified for those who may prefer : 对于那些可能更喜欢的人来说,这里曾经被贬低过:

union(names(df1), names(df2)) %>%
  set_names %>%
  map_dfc(~ list(df1[[.x]], df2[[.x]]) %>%
            compact %>%
            invoke(coalesce, .))

Explanations 说明

  • set_names gives to character vector names identical to its values, so map_dfc can name the output's columns right. set_names给出与其值相同的字符向量名称,因此map_dfc可以将输出的列命名为right。
  • df1[[.x]] will return NULL when .x is not a column of df1 , we take advantage of this. .x不是df1的列时, df1[[.x]] .x df1[[.x]]将返回NULL ,我们利用这一点。
  • df1 and df2 are mentioned 2 times each and I can't think of any way around it. df1df2每次提到2次,我想不出任何方法。

Solution 1 is cleaner in respect to these points so I recommend it. 解决方案1在这些方面更清洁,所以我推荐它。

We could use my package safejoin , do a left join and deal with the conflicts using dplyr::coalesce 我们可以使用我的包safejoin ,执行左连接并使用dplyr::coalesce处理冲突

# # devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)

safe_left_join(df1, df2, by = "ID", conflict = coalesce)
# # A tibble: 5 x 7
#      ID hello world hockey soccer football baseball
#   <int> <int> <int>  <int>  <int>    <int>    <int>
# 1     1     2     3      7      4       43        6
# 2     2     5     1      2      5       24       32
# 3     3    10     8      8     23        2       23
# 4     4     4    17      5     12        5       15
# 5     5     9     7      3     43       12        2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM