[英]join and merge is not return correct number of rows in R
I have two dataframes which share a common column (named sys_loc_code).我有两个共享一个公共列(名为 sys_loc_code)的数据帧。 The first dataframe (df1) has 1033 rows.
第一个数据帧 (df1) 有 1033 行。 The second dataframe (df2) has 2751.
第二个数据帧 (df2) 有 2751。
I would like to combine df1 and df2 so get a new dataframe with all columns found in df1 and df2 keeping only rows from df1.我想将 df1 和 df2 结合起来,以便获得一个新的数据框,其中包含在 df1 和 df2 中找到的所有列,仅保留来自 df1 的行。
I have tried join
, left_join
, and inner_join
(from dplyr
) and a simple merge
.我尝试过
join
、 left_join
和inner_join
(来自dplyr
)和一个简单的merge
。 Each of these returns 2057 rows, and I think it should only be returning 1033 to match what is in df1
.每个都返回 2057 行,我认为它应该只返回 1033 以匹配
df1
。 How do I return only rows from df1?如何仅从 df1 返回行?
I cannot share the datasets that caused this problem.我无法共享导致此问题的数据集。 However, after a bit of consultation, I can recreate the problem with this minimal example:
但是,经过一些咨询,我可以用这个最小的例子重现这个问题:
df1 <-
data.frame(
sys_loc_code = c("A", "B", "C")
, df1Val = 1
)
df2 <-
data.frame(
sys_loc_code = c("A", "B", "B", "C", "D")
, df2Val = c(1, 1, 2, 1, 1)
)
left_join(df1, df2)
Returns 4 rows while df1
only has three rows.返回 4 行而
df1
只有三行。
The most issue is that df2$sys_loc_code
contains multiple entries for some of the values in df1$sys_loc_code
.最大的问题是
df2$sys_loc_code
包含df1$sys_loc_code
某些值的多个条目。
df1$sys_loc_code
has only 3 values, but one of them ("B") is present twice in df2$sys_loc_code
, meaning those merges will return 4 rows. df1$sys_loc_code
只有 3 个值,但其中一个(“B”)在df2$sys_loc_code
出现两次,这意味着这些合并将返回 4 行。 eg例如
left_join(df1, df2)
gives给
sys_loc_code df1Val df2Val
1 A 1 1
2 B 1 1
3 B 1 2
4 C 1 1
So, the short answer to your question may be that the results actually are "correct" based on the code you are writing.因此,对您的问题的简短回答可能是,根据您编写的代码,结果实际上是“正确的”。 If you want something different to happen (eg, only one entry from
df2
per match), you will likely need to decide exactly what output you want.如果您希望发生不同的事情(例如,每个匹配项只有一个来自
df2
条目),您可能需要准确决定您想要的输出。
For example, if you want the first entry from df2
:例如,如果您想要
df2
的第一个条目:
left_join(
df1
, df2 %>%
group_by(sys_loc_code) %>%
slice(1)
)
gives给
sys_loc_code df1Val df2Val
1 A 1 1
2 B 1 1
3 C 1 1
left_join(
df1
, df2 %>%
group_by(sys_loc_code) %>%
summarise(df2Val = mean(df2Val))
)
gives给
sys_loc_code df1Val df2Val
1 A 1 1.0
2 B 1 1.5
3 C 1 1.0
and和
left_join(
df1
, df2 %>%
mutate(aVarToSortOn = 1:n()) %>%
group_by(sys_loc_code) %>%
slice(which.max(aVarToSortOn))
)
gives给
sys_loc_code df1Val df2Val aVarToSortOn
1 A 1 1 1
2 B 1 2 3
3 C 1 1 4
If you know you have unique values in a column, you could also use filter
to select which match to keep from df2
如果您知道列中有唯一值,您还可以使用
filter
来选择要从df2
保留的匹配项
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.