組合/合並來自數據幀的信息——需要非常具體的 output 格式

Question

我想進行一種非常具體的合並（處理來自微生物測序的分類數據）。 例如，如果我有兩個數據幀，每個都來自不同的樣本（樣本 1 和 2）：

df1 <- data.frame(lineage = c("1;131567", "1;131567;2;1224", "28216;32003;32011"),
                    taxonomy = c("bacteria", "archaea", "virus"),
                    count = c(15, 34, 12))
print(df1)
            lineage taxonomy count
1          1;131567 bacteria    15
2   1;131567;2;1224  archaea    34
3 28216;32003;32011    virus    12

df2 <- data.frame(lineage = c("204457;41297;165696", "1;131567;2;1224", "28216;32003;32011", "1;131567"),
                  taxonomy = c("fungi", "archaea", "virus", "bacteria"),
                  count = c(5, 34, 12, 11))
print(df2)
              lineage taxonomy count
1 204457;41297;165696    fungi     5
2     1;131567;2;1224  archaea    34
3   28216;32003;32011    virus    12
4            1;131567 bacteria    11

合並兩個數據幀的結果 output 我想要一個名為“血統”的列，其中包含來自兩個樣本的所有不同的唯一血統。 我希望保留分類法列（這唯一對應於譜系）。 我希望在之后顯示計數，因此 df1（示例 1）將有自己的“計數”列，並且如果在示例 2 中發現的引入譜系的任何缺失值顯示為零。 df2 中的“計數”將是 df1 中計數列之后的一列。

例如，最終的 output 應該是這樣的：

output <- data.frame(lineage = c("1;131567", "1;131567;2;1224", "28216;32003;32011", "204457;41297;165696"),
                  taxonomy = c("bacteria", "archaea", "virus", "fungi"),
                  count_df1 = c(15, 34, 12, NA),
                  count_df2 = c(11, 34, 12, 5))
print(output)
              lineage taxonomy count_df1 count_df2
1            1;131567 bacteria        15        11
2     1;131567;2;1224  archaea        34        34
3   28216;32003;32011    virus        12        12
4 204457;41297;165696    fungi        NA         5

每個 dataframe 的實際樣本都有 >5000 行，我計划使用合並選項將 5 個數據幀按順序合並為一個。 任何幫助，將不勝感激！！

我查看了 merge() 以及 dplyr join() 函數，但無法理解如何以我想要的方式保留信息。 也願意嘗試其他非合並選項（特別是考慮到我計划合並 6 個數據幀，而不僅僅是示例中的 2 個）。

Answer 1

您正在尋找的聯接類型是full_join ：

library(dplyr)

df1 |> 
  full_join(df2, by = c("lineage", "taxonomy"), suffix = c("_df1", "_df2"))  
#>               lineage taxonomy count_df1 count_df2
#> 1            1;131567 bacteria        15        11
#> 2     1;131567;2;1224  archaea        34        34
#> 3   28216;32003;32011    virus        12        12
#> 4 204457;41297;165696    fungi        NA         5

或者使用merge ：

merge(df1, df2, by = c("lineage", "taxonomy"), all = TRUE)
#>               lineage taxonomy count.x count.y
#> 1            1;131567 bacteria      15      11
#> 2     1;131567;2;1224  archaea      34      34
#> 3 204457;41297;165696    fungi      NA       5
#> 4   28216;32003;32011    virus      12      12

組合/合並來自數據幀的信息——需要非常具體的 output 格式

問題描述

1 個解決方案

解決方案1
1 已采納 2023-01-30 19:37:21

組合/合並來自數據幀的信息——需要非常具體的 output 格式

問題描述

1 個解決方案

解決方案1 1 已采納 2023-01-30 19:37:21

解決方案1
1 已采納 2023-01-30 19:37:21