简体   繁体   English

加入和合并不会在 R 中返回正确的行数

[英]join and merge is not return correct number of rows in R

I have two dataframes which share a common column (named sys_loc_code).我有两个共享一个公共列(名为 sys_loc_code)的数据帧。 The first dataframe (df1) has 1033 rows.第一个数据帧 (df1) 有 1033 行。 The second dataframe (df2) has 2751.第二个数据帧 (df2) 有 2751。

I would like to combine df1 and df2 so get a new dataframe with all columns found in df1 and df2 keeping only rows from df1.我想将 df1 和 df2 结合起来,以便获得一个新的数据框,其中包含在 df1 和 df2 中找到的所有列,仅保留来自 df1 的行。

I have tried join , left_join , and inner_join (from dplyr ) and a simple merge .我尝试过joinleft_joininner_join (来自dplyr )和一个简单的merge Each of these returns 2057 rows, and I think it should only be returning 1033 to match what is in df1 .每个都返回 2057 行,我认为它应该只返回 1033 以匹配df1 How do I return only rows from df1?如何仅从 df1 返回行?

I cannot share the datasets that caused this problem.我无法共享导致此问题的数据集。 However, after a bit of consultation, I can recreate the problem with this minimal example:但是,经过一些咨询,我可以用这个最小的例子重现这个问题:

df1 <-
  data.frame(
    sys_loc_code = c("A", "B", "C")
    , df1Val = 1
  )


df2 <-
  data.frame(
    sys_loc_code = c("A", "B", "B", "C", "D")
    , df2Val = c(1, 1, 2, 1, 1)
  )

left_join(df1, df2)

Returns 4 rows while df1 only has three rows.返回 4 行而df1只有三行。

The most issue is that df2$sys_loc_code contains multiple entries for some of the values in df1$sys_loc_code .最大的问题是df2$sys_loc_code包含df1$sys_loc_code某些值的多个条目。

df1$sys_loc_code has only 3 values, but one of them ("B") is present twice in df2$sys_loc_code , meaning those merges will return 4 rows. df1$sys_loc_code只有 3 个值,但其中一个(“B”)在df2$sys_loc_code出现两次,这意味着这些合并将返回 4 行。 eg例如

left_join(df1, df2)

gives

  sys_loc_code df1Val df2Val
1            A      1      1
2            B      1      1
3            B      1      2
4            C      1      1

So, the short answer to your question may be that the results actually are "correct" based on the code you are writing.因此,对您的问题的简短回答可能是,根据您编写的代码,结果实际上是“正确的”。 If you want something different to happen (eg, only one entry from df2 per match), you will likely need to decide exactly what output you want.如果您希望发生不同的事情(例如,每个匹配项只有一个来自df2条目),您可能需要准确决定您想要的输出。

For example, if you want the first entry from df2 :例如,如果您想要df2的第一个条目:

left_join(
  df1
  , df2 %>%
    group_by(sys_loc_code) %>%
    slice(1)
)

gives

  sys_loc_code df1Val df2Val
1            A      1      1
2            B      1      1
3            C      1      1


left_join(
  df1
  , df2 %>%
    group_by(sys_loc_code) %>%
    summarise(df2Val = mean(df2Val))
)

gives

  sys_loc_code df1Val df2Val
1            A      1    1.0
2            B      1    1.5
3            C      1    1.0

and

left_join(
  df1
  , df2 %>%
    mutate(aVarToSortOn = 1:n()) %>%
    group_by(sys_loc_code) %>%
    slice(which.max(aVarToSortOn))
)

gives

  sys_loc_code df1Val df2Val aVarToSortOn
1            A      1      1            1
2            B      1      2            3
3            C      1      1            4

If you know you have unique values in a column, you could also use filter to select which match to keep from df2如果您知道列中有唯一值,您还可以使用filter来选择要从df2保留的匹配项

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM