简体   繁体   English

dplyr 包的inner_join 输出不正确

[英]Incorrect output from inner_join of dplyr package

I have two datasets, named "results" and "support2", available here .我有两个数据集,名为“results”和“support2”,可在此处获得

I want to merge the two datasets by the only common column name "SNP".我想通过唯一的公共列名称“SNP”合并两个数据集。 Code below:代码如下:

> library(dplyr)
> results <- read_delim("<path>\\results", delim = "\t", col_name = T)
> support2 <- read_delim("<path>\\support2", delim = "\t", col_name = T)

> head(results)
# A tibble: 6 x 2
  SNP        p.value
  <chr>        <dbl>
1 rs28436661   0.334
2 rs9922067    0.322
3 rs2562132    0.848
4 rs3930588    0.332
5 rs2562137    0.323
6 rs3848343    0.363

 > head(support2)
# A tibble: 6 x 2
  SNP         position
  <chr>          <dbl>
1 rs62028702     60054
2 rs190434815    60085
3 rs62028703     60087
4 rs62028704     60095
5 rs181534180    60164
6 rs186233776    60177

> dim(results)
[1] 188242      2
> dim(support2)
[1] 1210619       2

# determine the number of common SNPs
length(Reduce(intersect, list(results$SNP, support2$SNP)))
[1] 187613

I would expect that after inner_join, the new data would have 187613 rows.

> newdata <- inner_join(results, support2)
Joining, by = "SNP"
> dim(newdata)
[1] 1409812       3

Strangely, instead of have 187613 rows, the new data have 1409812 rows, which is even larger than the sum of the number of rows of the two dataframes.奇怪的是,新数据不是 187613 行,而是 1409812 行,这甚至大于两个数据帧的行数之和。

I switched to the merge function as below:我切换到合并功能如下:

> newdata2 <- merge(results, support2)
> dim(newdata2)
[1] 1409812       3

This second new dataframe has the same issue.第二个新数据帧也有同样的问题。 No idea why.不知道为什么。

I wish to know how should I obtain a new dataframe whose rows represent the common rows of the two dataframes (should have 187613 rows) and whose columns contain columns of both dataframes.我想知道我应该如何获得一个新的数据帧,它的行代表两个数据帧的公共行(应该有 187613 行)并且它的列包含两个数据帧的列。

It could be a result of duplicate elements这可能是重复元素的结果

results <- data.frame(col1 = rep(letters[1:3], each = 3), col2 = rnorm(9))
support2 <- data.frame(col1 = rep(letters[1:5],each = 2), newcol = runif(10))

library(dplyr)
out <- inner_join(results, support2)
nrow(out)
#[1] 18

Here, the initial datasets in the common column ('col1') are duplicated which confuses the join statement as to which row it should take as a match resulting in a situation similar to a cross join but not exactly that在这里,公共列 ('col1') 中的初始数据集被复制,这会混淆连接语句,因为它应该将哪一行作为匹配项,从而导致类似于交叉连接的情况,但不完全相同

As already pointed out by @akrun, the data may have duplicates, possibly that is the only explanation of this behavior.正如@akrun 已经指出的那样,数据可能有重复,这可能是对这种行为的唯一解释。

From the documentation of intersect, it always returns a unique value but inner join can have duplicates if the "by" value has duplicates, Hence the count mismatch.从 intersect 的文档来看,它总是返回一个唯一值,但如果“by”值有重复,则内部连接可能有重复,因此计数不匹配。

If you truly want to see its right, see the unique counts of by variable (unique key in your case), it should match with your intersect result.如果您真的想看到它的正确性,请查看 by 变量的唯一计数(在您的情况下是唯一键),它应该与您的相交结果匹配。 But that doesn't mean your join/merge is right, ideally any join which has duplicates in both table A and B is not recommended(unless offcourse you have business/other justification).但这并不意味着您的加入/合并是正确的,理想情况下,不推荐在表 A 和 B 中具有重复项的任何联接(除非您有业务/其他理由)。 So, check if the duplicates are present in both the tables or only one of them.因此,请检查两个表中是否存在重复项,还是仅在其中一个表中存在。 If it only found in one of the tables then probably your merge/join should be alright.如果它只在其中一个表中找到,那么您的合并/加入可能应该没问题。 I hope I am able to explain the scenario.我希望我能够解释这个场景。

Please let me know if it doesn't answer your question, I shall remove it.如果它不能回答您的问题,请告诉我,我将删除它。

From Documentations:从文档:

intersect:相交:

Each of union, intersect, setdiff and setequal will discard any duplicated values in the arguments, and they apply as.vector to their arguments union、intersect、setdiff 和 setequal 中的每一个都将丢弃参数中的任何重复值,并将 as.vector 应用于它们的参数

inner_join():内部联接():

return all rows from x where there are matching values in y, and all columns from x and y.返回 x 中 y 中有匹配值的所有行,以及 x 和 y 中的所有列。 If there are multiple matches between x and y, all combination of the matches are returned.如果 x 和 y 之间有多个匹配项,则返回匹配项的所有组合。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM