[英]R merge and left_join outputs duplicated rows
I have two data frames with this structure:我有两个具有这种结构的数据框:
> df_gen[1:5,]
Genus mean_RA
1 Unclassified 0.1357401738
2 Lactobacillus 0.0003825068
3 Prevotella 9 0.0009573787
4 Anaerovibrio 0.0049035545
5 Roseburia 0.0026672558
> df_tax[1:8,]
Kingdom Phylum Class Order Family Genus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella 9
2 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella 9
3 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae Prevotella 9
4 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
5 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae Anaerovibrio
6 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae Anaerovibrio
7 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
8 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae Roseburia
I want to merge df_gen
with df_tax
, but when I do every row completely duplicates, in this way:我想将
df_gen
与df_tax
合并,但是当我以这种方式将每一行完全重复时:
> merge(df_gen, df_tax, by = "Genus", all.x = TRUE)
Genus mean_RA Kingdom Phylum Class Order Family
1 Unclassified 0.1357401738 NA NA NA NA NA
2 Lactobacillus 0.0003825068 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae
3 Lactobacillus 0.0003825068 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae
4 Prevotella 9 0.0009573787 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae
5 Prevotella 9 0.0009573787 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae
6 Prevotella 9 0.0009573787 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae
7 Anaerovibrio 0.0049035545 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae
8 Anaerovibrio 0.0049035545 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae
9 Roseburia 0.0026672558 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae
I don't know why everything in x
is getting duplicated according tho the number of repetitions in y
.我不知道为什么
x
所有内容都会根据y
的重复次数而重复。 My desired output should have the same row dimension as df_gen
, adding columns from df_tax
.我想要的输出应该与
df_gen
具有相同的行维度,从df_tax
添加列。
I tried also with dplyr::left_join
and I end up with the same problem.我也尝试过
dplyr::left_join
,但最终还是dplyr::left_join
了同样的问题。
I checked other posts on the internet but I found nothing to solve this issue... Any clues?我查看了互联网上的其他帖子,但没有找到解决此问题的方法......有任何线索吗?
The function is working as expected, it's merging each row from df_tax
on df_gen
, and since there are multiple values present in df_tax
that match a value in df_gen
, you get multiple rows.该函数按预期工作,它正在合并
df_tax
上df_gen
每一行,并且由于df_gen
存在多个与df_tax
中的值匹配的值, df_gen
您会得到多行。 df_tax
has duplicated rows, that is the issue. df_tax
有重复的行,这就是问题所在。
Both merge(x, y, all.x=TRUE)
and left_join(x, y)
will keep all rows from x whether or not they have a match in y, so basically these commands avoid non-matching rows in x to be discarded, but do not avoid multiple matching. merge(x, y, all.x=TRUE)
和left_join(x, y)
都会保留 x 中的所有行,无论它们在 y 中是否匹配,所以基本上这些命令避免了 x 中不匹配的行被丢弃,但不要避免多重匹配。 If y has duplicates on the key variable (in your case, "Genus"), and they have a match in x, you will get duplicates.如果 y 在关键变量上有重复项(在您的情况下为“Genus”),并且它们在 x 中有匹配项,您将得到重复项。 From a plain logic point of view, that makes sense: which of the two duplicated rows in y should be matched?
从简单的逻辑角度来看,这是有道理的:应该匹配 y 中的两个重复行中的哪一个? The function has no way to know, so it matches both.
该函数无法知道,因此两者都匹配。 If you want to get a file with the same row number of df_genus, you need df_tax to have no duplicates.
如果你想得到一个与 df_genus 行号相同的文件,你需要 df_tax 没有重复。 If the rows with duplicated Genus are identical also with respect to the other variables, you can go along the line of the comment by r.user.05apr:
df_tax_unique <- df_tax[!duplicated(df_tax$Genus), ]
: this will only keep the first of duplicated rows.如果具有重复 Genus 的行对于其他变量也相同,您可以按照 r.user.05apr 的评论行:
df_tax_unique <- df_tax[!duplicated(df_tax$Genus), ]
:这只会保留重复行的第一行。 If rows have the same Genus, but differ with respect to the other variable, you need to make decision according to your needs: you can augment df_genus or you can delete from df_tax the rows you don't want to add to df_genus.如果行具有相同的属,但与其他变量不同,则需要根据需要做出决定:可以增加 df_genus 或从 df_tax 中删除不想添加到 df_genus 的行。
As a solution to your problem you could write a function that match
es on rows, similar to this one:作为您问题的解决方案,您可以编写一个在行上
match
es 的函数,类似于这个:
matchRows <- function(df1, df2, by) {
do.call(rbind, apply(df1, 1, function(x) {
m <- match(x[[by]], df2[[by]])
`rownames<-`(cbind(t(x), df2[m, -which(names(df2) == by)]), NULL)
}))}
matchRows(df1=df_gen, df2=df_tax, by="Genus")
# Genus mean_RA Kingdom Phylum Class Order Family
# 1 Unclassified 0.1357401738 <NA> <NA> <NA> <NA> <NA>
# 2 Lactobacillus 0.0003825068 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae
# 3 Prevotella9 0.0009573787 Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae
# 4 Anaerovibrio 0.0049035545 Bacteria Firmicutes Negativicutes Selenomonadales Veillonellaceae
# 5 Roseburia 0.0026672558 Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae
Data:数据:
df_gen <- structure(list(Genus = c("Unclassified", "Lactobacillus", "Prevotella9",
"Anaerovibrio", "Roseburia"), mean_RA = c(0.1357401738, 0.0003825068,
0.0009573787, 0.0049035545, 0.0026672558)), row.names = c(NA,
-5L), class = "data.frame")
df_tax <- structure(list(Kingdom = c("Bacteria", "Bacteria", "Bacteria",
"Bacteria", "Bacteria", "Bacteria", "Bacteria", "Bacteria"),
Phylum = c("Bacteroidetes", "Bacteroidetes", "Bacteroidetes",
"Firmicutes", "Firmicutes", "Firmicutes", "Firmicutes", "Firmicutes"
), Class = c("Bacteroidia", "Bacteroidia", "Bacteroidia",
"Bacilli", "Negativicutes", "Negativicutes", "Bacilli", "Clostridia"
), Order = c("Bacteroidales", "Bacteroidales", "Bacteroidales",
"Lactobacillales", "Selenomonadales", "Selenomonadales",
"Lactobacillales", "Clostridiales"), Family = c("Prevotellaceae",
"Prevotellaceae", "Prevotellaceae", "Lactobacillaceae", "Veillonellaceae",
"Veillonellaceae", "Lactobacillaceae", "Lachnospiraceae"),
Genus = c("Prevotella9", "Prevotella9", "Prevotella9", "Lactobacillus",
"Anaerovibrio", "Anaerovibrio", "Lactobacillus", "Roseburia"
)), row.names = c(NA, -8L), class = "data.frame")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.