简体   繁体   English

R merge 和 left_join 输出重复的行

[英]R merge and left_join outputs duplicated rows

I have two data frames with this structure:我有两个具有这种结构的数据框:

> df_gen[1:5,]              
          Genus      mean_RA
1  Unclassified 0.1357401738
2 Lactobacillus 0.0003825068
3  Prevotella 9 0.0009573787
4  Anaerovibrio 0.0049035545
5     Roseburia 0.0026672558

> df_tax[1:8,]              
   Kingdom        Phylum         Class           Order           Family         Genus
1 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae  Prevotella 9
2 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae  Prevotella 9
3 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae  Prevotella 9 
4 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae Lactobacillus
5 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae  Anaerovibrio
6 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae  Anaerovibrio
7 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae Lactobacillus
8 Bacteria    Firmicutes    Clostridia   Clostridiales  Lachnospiraceae     Roseburia

I want to merge df_gen with df_tax , but when I do every row completely duplicates, in this way:我想将df_gendf_tax合并,但是当我以这种方式将每一行完全重复时:

> merge(df_gen, df_tax, by = "Genus", all.x = TRUE)
          Genus      mean_RA  Kingdom        Phylum         Class           Order           Family
1  Unclassified 0.1357401738       NA            NA            NA              NA               NA
2 Lactobacillus 0.0003825068 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae
3 Lactobacillus 0.0003825068 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae
4  Prevotella 9 0.0009573787 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae
5  Prevotella 9 0.0009573787 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae
6  Prevotella 9 0.0009573787 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae
7  Anaerovibrio 0.0049035545 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae
8  Anaerovibrio 0.0049035545 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae
9     Roseburia 0.0026672558 Bacteria    Firmicutes    Clostridia   Clostridiales  Lachnospiraceae

I don't know why everything in x is getting duplicated according tho the number of repetitions in y .我不知道为什么x所有内容都会根据y的重复次数而重复。 My desired output should have the same row dimension as df_gen , adding columns from df_tax .我想要的输出应该与df_gen具有相同的行维度,从df_tax添加列。

I tried also with dplyr::left_join and I end up with the same problem.我也尝试过dplyr::left_join ,但最终还是dplyr::left_join了同样的问题。

I checked other posts on the internet but I found nothing to solve this issue... Any clues?我查看了互联网上的其他帖子,但没有找到解决此问题的方法......有任何线索吗?

The function is working as expected, it's merging each row from df_tax on df_gen , and since there are multiple values present in df_tax that match a value in df_gen , you get multiple rows.该函数按预期工作,它正在合并df_taxdf_gen每一行,并且由于df_gen存在多个与df_tax中的值匹配的值, df_gen您会得到多行。 df_tax has duplicated rows, that is the issue. df_tax有重复的行,这就是问题所在。

Both merge(x, y, all.x=TRUE) and left_join(x, y) will keep all rows from x whether or not they have a match in y, so basically these commands avoid non-matching rows in x to be discarded, but do not avoid multiple matching. merge(x, y, all.x=TRUE)left_join(x, y)都会保留 x 中的所有行,无论它们在 y 中是否匹配,所以基本上这些命令避免了 x 中不匹配的行被丢弃,但不要避免多重匹配。 If y has duplicates on the key variable (in your case, "Genus"), and they have a match in x, you will get duplicates.如果 y 在关键变量上有重复项(在您的情况下为“Genus”),并且它们在 x 中有匹配项,您将得到重复项。 From a plain logic point of view, that makes sense: which of the two duplicated rows in y should be matched?从简单的逻辑角度来看,这是有道理的:应该匹配 y 中的两个重复行中的哪一个? The function has no way to know, so it matches both.该函数无法知道,因此两者都匹配。 If you want to get a file with the same row number of df_genus, you need df_tax to have no duplicates.如果你想得到一个与 df_genus 行号相同的文件,你需要 df_tax 没有重复。 If the rows with duplicated Genus are identical also with respect to the other variables, you can go along the line of the comment by r.user.05apr: df_tax_unique <- df_tax[!duplicated(df_tax$Genus), ] : this will only keep the first of duplicated rows.如果具有重复 Genus 的行对于其他变量也相同,您可以按照 r.user.05apr 的评论行: df_tax_unique <- df_tax[!duplicated(df_tax$Genus), ] :这只会保留重复行的第一行。 If rows have the same Genus, but differ with respect to the other variable, you need to make decision according to your needs: you can augment df_genus or you can delete from df_tax the rows you don't want to add to df_genus.如果行具有相同的属,但与其他变量不同,则需要根据需要做出决定:可以增加 df_genus 或从 df_tax 中删除不想添加到 df_genus 的行。

As a solution to your problem you could write a function that match es on rows, similar to this one:作为您问题的解决方案,您可以编写一个在行上match es 的函数,类似于这个:

matchRows <- function(df1, df2, by) {
  do.call(rbind, apply(df1, 1, function(x) {
    m <- match(x[[by]], df2[[by]])
    `rownames<-`(cbind(t(x), df2[m, -which(names(df2) == by)]), NULL)
  }))}

matchRows(df1=df_gen, df2=df_tax, by="Genus")
#           Genus      mean_RA  Kingdom        Phylum         Class           Order           Family
# 1  Unclassified 0.1357401738     <NA>          <NA>          <NA>            <NA>             <NA>
# 2 Lactobacillus 0.0003825068 Bacteria    Firmicutes       Bacilli Lactobacillales Lactobacillaceae
# 3   Prevotella9 0.0009573787 Bacteria Bacteroidetes   Bacteroidia   Bacteroidales   Prevotellaceae
# 4  Anaerovibrio 0.0049035545 Bacteria    Firmicutes Negativicutes Selenomonadales  Veillonellaceae
# 5     Roseburia 0.0026672558 Bacteria    Firmicutes    Clostridia   Clostridiales  Lachnospiraceae

Data:数据:

df_gen <- structure(list(Genus = c("Unclassified", "Lactobacillus", "Prevotella9", 
"Anaerovibrio", "Roseburia"), mean_RA = c(0.1357401738, 0.0003825068, 
0.0009573787, 0.0049035545, 0.0026672558)), row.names = c(NA, 
-5L), class = "data.frame")

df_tax <- structure(list(Kingdom = c("Bacteria", "Bacteria", "Bacteria", 
"Bacteria", "Bacteria", "Bacteria", "Bacteria", "Bacteria"), 
    Phylum = c("Bacteroidetes", "Bacteroidetes", "Bacteroidetes", 
    "Firmicutes", "Firmicutes", "Firmicutes", "Firmicutes", "Firmicutes"
    ), Class = c("Bacteroidia", "Bacteroidia", "Bacteroidia", 
    "Bacilli", "Negativicutes", "Negativicutes", "Bacilli", "Clostridia"
    ), Order = c("Bacteroidales", "Bacteroidales", "Bacteroidales", 
    "Lactobacillales", "Selenomonadales", "Selenomonadales", 
    "Lactobacillales", "Clostridiales"), Family = c("Prevotellaceae", 
    "Prevotellaceae", "Prevotellaceae", "Lactobacillaceae", "Veillonellaceae", 
    "Veillonellaceae", "Lactobacillaceae", "Lachnospiraceae"), 
    Genus = c("Prevotella9", "Prevotella9", "Prevotella9", "Lactobacillus", 
    "Anaerovibrio", "Anaerovibrio", "Lactobacillus", "Roseburia"
    )), row.names = c(NA, -8L), class = "data.frame")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM