合並/組合具有相同名稱但不完整數據的列

Question

我有兩個數據框，其中一些列具有相同的名稱，另一些具有不同的名稱。 數據框看起來像這樣：

df1
      ID hello world hockey soccer
    1  1    NA    NA      7      4
    2  2    NA    NA      2      5
    3  3    10     8      8     23
    4  4     4    17      5     12
    5  5    NA    NA      3     43

df2    
      ID hello world football baseball
    1  1     2     3       43        6
    2  2     5     1       24       32
    3  3    NA    NA        2       23
    4  4    NA    NA        5       15
    5  5     9     7       12       23

如您所見，在2個共享列（“hello”和“world”）中，一些數據位於其中一個數據框中，其余數據位於另一個數據框中。

我要做的是（1）通過“id”合並2個數據幀，（2）將兩個幀中“hello”和“world”列的所有數據合並為1個“hello”列和1個“world” “列，和（3）最后的數據框還包含2個原始幀中的所有其他列（”曲棍球“，”足球“，”足球“，”棒球“）。 所以，我希望最終的結果如下：

  ID hello world hockey soccer football baseball
1  1     2     3      7      4        43       6
2  2     5     3      2      5        24      32
3  3    10     8      8     23         2      23
4  4     4    17      5     12         5      15
5  5     9     7      3     43        12      23

我是R的新手，所以我嘗試的唯一代碼是merge變化，我嘗試了我在這里找到的答案，這是基於一個類似的問題： R：合並同一個變量的副本。 但是，我的數據集實際上比我在這里顯示的要大得多（大約有20個匹配的列（如“hello”和“world”）和100個不匹配的列（如“曲棍球”和“足球”））所以我正在尋找一些不需要我手動編寫的東西。

有什么想法可以做到嗎？ 對不起，我無法提供我的努力樣本，但我真的不知道從哪里開始：

mydata <- merge(df1, df2, by=c("ID"), all = TRUE)

要重現數據框：

df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9), 
       world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12), 
       baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world", 
       "football", "baseball"), class = "data.frame", row.names = c(NA, -5L)) 

df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA), 
       world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3), 
       soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey", 
       "soccer"), class = "data.frame", row.names = c(NA, -5L))

Answer 1

這是一種方法，涉及melt數據，合並熔融數據，並使用dcast將其恢復為寬泛的形式。 我添加了評論以幫助了解正在發生的事情。

## Required packages
library(data.table)
library(reshape2)

dcast.data.table(
  merge(
    ## melt the first data.frame and set the key as ID and variable
    setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable), 
    ## melt the second data.frame
    melt(as.data.table(df2), id.vars = "ID"), 
    ## you'll have 2 value columns...
    all = TRUE)[, value := ifelse(
      ## ... combine them into 1 with ifelse
      is.na(value.x), value.y, value.x)], 
  ## This is your reshaping formula
  ID ~ variable, value.var = "value")
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43

Answer 2

沒有人的貼dplyr的解決方案，所以這里是一個簡潔的選項dplyr 。 該方法只是執行一個組合所有行的full_join ，然后進行group和summarise以刪除冗余的缺失單元格。

library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

df1 %>%
  full_join(df2, by = intersect(colnames(df1), colnames(df2))) %>%
  group_by(ID) %>%
  summarize_all(na.omit)
#> # A tibble: 5 x 7
#>      ID hello world hockey soccer football baseball
#>   <int> <int> <int>  <int>  <int>    <int>    <int>
#> 1     1     2     3      7      4       43        6
#> 2     2     5     1      2      5       24       32
#> 3     3    10     8      8     23        2       23
#> 4     4     4    17      5     12        5       15
#> 5     5     9     7      3     43       12        2

由reprex包創建於2018-07-13（v0.2.0）。

Answer 3

這是使用二進制合並的另一種data.table方法

library(data.table)
setkey(setDT(df1), ID) ; setkey(setDT(df2), ID) # Converting to data.table objects and setting keys
df1 <- df1[df2][, `:=`(i.hello = NULL, i.world = NULL)] # Full left join
df1[df2[complete.cases(df2)], `:=`(hello = i.hello, world = i.world)][] # Joining only on non-missing values
#    ID hello world football baseball hockey soccer
# 1:  1     2     3       43        6      7      4
# 2:  2     5     1       24       32      2      5
# 3:  3    10     8        2       23      8     23
# 4:  4     4    17        5       15      5     12
# 5:  5     9     7       12       23      3     43

Answer 4

@ ananda-mahto的答案更優雅，但這是我的建議：

library(reshape2)
df1=melt(df1,id='ID',na.rm=TRUE)
df2=melt(df2,id='ID',na.rm=TRUE)
DF=rbind(df1,df2)
# Not needeed,  added na.rm=TRUE based on @ananda-mahto's valid comment
# DF<-DF[!is.na(DF$value),]
dcast(DF,ID~variable,value.var='value')

Answer 5

這是一個更加以tidyr為中心的方法，它做了類似於當前接受的答案。 方法只是使用bind_rows （匹配列名稱）將數據框堆疊gather ，使用na.rm = TRUE gather所有非ID列，然后spread它們展開。 對於條件“如果值為NA in”df1“它將具有”df2“中的值（反之亦然）”與summarise選項相比“並不總是成立的情況，這應該是穩健的。

library(tidyverse)
df1 <- structure(list(ID = 1:5, hello = c(NA, NA, 10L, 4L, NA), world = c(NA, NA, 8L, 17L, NA), hockey = c(7L, 2L, 8L, 5L, 3L), soccer = c(4L, 5L, 23L, 12L, 43L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), hockey = structure(list(), class = c("collector_integer", "collector")), soccer = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
df2 <- structure(list(ID = 1:5, hello = c(2L, 5L, NA, NA, 9L), world = c(3L, 1L, NA, NA, 7L), football = c(43L, 24L, 2L, 5L, 12L), baseball = c(6L, 32L, 23L, 15L, 2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_integer", "collector")), hello = structure(list(), class = c("collector_integer", "collector")), world = structure(list(), class = c("collector_integer", "collector")), football = structure(list(), class = c("collector_integer", "collector")), baseball = structure(list(), class = c("collector_integer", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))

df1 %>%
  bind_rows(df2) %>%
  gather(variable, value, -ID, na.rm = TRUE) %>%
  spread(variable, value)
#> # A tibble: 5 x 7
#>      ID baseball football hello hockey soccer world
#>   <int>    <int>    <int> <int>  <int>  <int> <int>
#> 1     1        6       43     2      7      4     3
#> 2     2       32       24     5      2      5     1
#> 3     3       23        2    10      8     23     8
#> 4     4       15        5     4      5     12    17
#> 5     5        2       12     9      3     43     7

由reprex包創建於2018-07-13（v0.2.0）。

Answer 6

使用tidyverse我們可以使用coalesce 。

下面的解決方案都沒有構建額外的行，數據在整個鏈中保持大致相同的大小和相似的形狀。

解決方案1

list(df1,df2) %>%
  transpose(union(names(df1),names(df2))) %>%
  map_dfc(. %>% compact %>% invoke(coalesce,.))

# # A tibble: 5 x 7
#      ID hello world football baseball hockey soccer
#   <int> <dbl> <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
# 1     1     2     3       43        6      7      4
# 2     2     5     1       24       32      2      5
# 3     3    10     8        2       23      8     23
# 4     4     4    17        5       15      5     12
# 5     5     9     7       12       23      3     43

說明

將兩個數據幀包裝到一個list
transpose它，因此根目錄中的每個新項目都具有輸出列的名稱。 transpose默認行為是將第一個參數作為模板，所以不幸的是我們必須明確地獲取所有這些參數。
compact這些項目，因為它們都是長度為2，但是當一側缺少給定列時，其中一個為NULL 。
coalesce那些，這基本上意味着在並排放置參數時返回你找到的第一個非NA 。

如果在第二行重復df1和df2是個問題，請使用以下代碼：

transpose(invoke(union, setNames(map(., names), c("x","y"))))

解決方案2

同樣的哲學，但這一次我們循環名稱：

map_dfc(set_names(union(names(df1), names(df2))),
        ~ invoke(coalesce, compact(list(df1[[.x]], df2[[.x]]))))

# # A tibble: 5 x 7
#      ID hello world football baseball hockey soccer
#   <int> <dbl> <dbl>    <dbl>    <dbl>  <dbl>  <dbl>
# 1     1     2     3       43        6      7      4
# 2     2     5     1       24       32      2      5
# 3     3    10     8        2       23      8     23
# 4     4     4    17        5       15      5     12
# 5     5     9     7       12       23      3     43

對於那些可能更喜歡的人來說，這里曾經被貶低過：

union(names(df1), names(df2)) %>%
  set_names %>%
  map_dfc(~ list(df1[[.x]], df2[[.x]]) %>%
            compact %>%
            invoke(coalesce, .))

說明

set_names給出與其值相同的字符向量名稱，因此map_dfc可以將輸出的列命名為right。
當.x不是df1的列時， df1[[.x]] .x df1[[.x]]將返回NULL ，我們利用這一點。
df1和df2每次提到2次，我想不出任何方法。

解決方案1在這些方面更清潔，所以我推薦它。

Answer 7

我們可以使用我的包safejoin ，執行左連接並使用dplyr::coalesce處理沖突

# # devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)

safe_left_join(df1, df2, by = "ID", conflict = coalesce)
# # A tibble: 5 x 7
#      ID hello world hockey soccer football baseball
#   <int> <int> <int>  <int>  <int>    <int>    <int>
# 1     1     2     3      7      4       43        6
# 2     2     5     1      2      5       24       32
# 3     3    10     8      8     23        2       23
# 4     4     4    17      5     12        5       15
# 5     5     9     7      3     43       12        2

合並/組合具有相同名稱但不完整數據的列

問題描述

7 個解決方案

解決方案1
12 已采納 2014-11-27 09:41:11

解決方案2
8 2018-07-13 20:46:53

解決方案3
6 2014-11-27 09:49:57

解決方案4
5 2014-11-27 09:50:06

解決方案5
5 2018-07-13 21:53:44

解決方案6
4 2018-07-17 16:43:22

解決方案7
0 2019-02-25 23:02:20

合並/組合具有相同名稱但不完整數據的列

問題描述

7 個解決方案

解決方案1 12 已采納 2014-11-27 09:41:11

解決方案2 8 2018-07-13 20:46:53

解決方案3 6 2014-11-27 09:49:57

解決方案4 5 2014-11-27 09:50:06

解決方案5 5 2018-07-13 21:53:44

解決方案6 4 2018-07-17 16:43:22

解決方案7 0 2019-02-25 23:02:20

解決方案1
12 已采納 2014-11-27 09:41:11

解決方案2
8 2018-07-13 20:46:53

解決方案3
6 2014-11-27 09:49:57

解決方案4
5 2014-11-27 09:50:06

解決方案5
5 2018-07-13 21:53:44

解決方案6
4 2018-07-17 16:43:22

解決方案7
0 2019-02-25 23:02:20