简体   繁体   English

使用完全连接来匹配两个不同数据帧中的值

[英]Using full join to match values in two different dataframes

I have two dataframes.The first data frame consists of four columns 1) ID, 2) Site, 3) Depth, and 4) Density.我有两个数据框。第一个数据框由四列组成:1)ID、2)站点、3)深度和 4)密度。 The second dataframe consists of 1) ID 2) Site, and 3) Choice.第二个 dataframe 由 1) ID 2) 站点和 3) 选择组成。

df1 df1

  ID Sites Depth Density
  1     B   0.2       0
  2     B   0.2       1
  3     D   0.3       0
  4     D   0.3       1
  5     B   0.2       2

df2 df2

  ID Sites Choice 
  1     A    No
  1     B    Yes     
  1     C    No
  1     D    No
  2     A    No
  2     B    Yes
  2     C    No
  2     D    No
  3     A    No
  3     B    No
  3     C    No
  3     D    Yes
  4     A    No
  4     B    No
  4     C    No
  4     D    Yes
  5     A    No
  5     B    Yes
  5     C    No
  5     D    No

What I am trying to do is add a column to df2 that has the densities in each site when the ID has a "Yes".我要做的是在 df2 中添加一个列,当 ID 为“是”时,该列具有每个站点中的密度。 Below is what I want the output to be:下面是我想要的 output 是:

Desired Output所需 Output

  ID Sites Choice Depth  Density
  1     A    No     0.1     0
  1     B    Yes    0.2     0 
  1     C    No     0.3     0 
  1     D    No     0.4     0
  2     A    No     0.1     0
  2     B    Yes    0.2     1
  2     C    No     0.3     0
  2     D    No     0.4     0
  3     A    No     0.1     0
  3     B    No     0.2     1
  3     C    No     0.3     0
  3     D    Yes    0.4     0
  4     A    No     0.1     0
  4     B    No     0.2     1
  4     C    No     0.3     0
  4     D    Yes    0.4     1
  5     A    No     0.1     0
  5     B    Yes    0.2     2
  5     C    No     0.3     0
  5     D    No     0.4     1

I've tried using the following but it doesn't work:我尝试使用以下方法,但它不起作用:

     df3<-df2 %>%
     full_join(df1, by = c("ID", "Sites")) %>%
     group_by(ID) %>%
     mutate(Density = Density[Choice == "Yes"]) %>%
     distinct(ID, Sites, .keep_all = TRUE) 

Thank you for your help, stackoverflow community.感谢您的帮助,stackoverflow 社区。

Is this what you are looking for?这是你想要的?

df1<-data.frame(stringsAsFactors=FALSE,
                ID = c(1, 2, 3, 4, 5),
                Sites = c("B", "B", "D", "D", "B"),
                Depth = c(0.2, 0.2, 0.3, 0.3, 0.2),
                Density = c(0, 1, 0, 1, 2)
)

df2<-data.frame(stringsAsFactors=FALSE,
                ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5),
                Sites = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D",
                          "A", "B", "C", "D", "A", "B", "C", "D"),
                Choice = c("No", "Yes", "No", "No", "No", "Yes", "No", "No", "No", "No",
                           "No", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No",
                           "No")
)

df3<-df2 %>%
  left_join(df1, by = c("ID", "Sites")) %>%
  mutate(Density=if_else(Choice=="Yes",Density, 0))

Data.数据。

  df1 <- data.frame(
      ID = seq(from = 1, to = 5, by = 1),
      Sites = c("B", "B", "D", "D", "B"),
      Depth = c(0.2, 0.2, 0.3, 0.3, 0.2),
      Density = c(0,1,0,1,2)
    )

qw <- c("No", "Yes")
  
df2 <- data.frame(
  ID = c(rep(1:5, times = 5)),
  Sites = sample(LETTERS[1:4], size = 25, replace = TRUE),
  Choice = sample(qw, size = 25, replace = T)
)

But I am slightly confused by what you need from this reprex.但是我对你需要从这个代表中得到什么感到有些困惑。 Your statement makes it seem that you want a subsetted DF3, which only contains records where CHOICE == "YES" is TRUE, and ID = ID, and Site = Site.您的陈述使您似乎想要一个子集的 DF3,它只包含 CHOICE == "YES" 为 TRUE、ID = ID 和 Site = Site 的记录。 Obviously this will result in very few records.显然,这将导致很少的记录。

df3 <- merge(df1, df2, by = c("Sites", "ID")) %>% 
filter(Choice == "Yes") # only one record from the dummy data.

If the Choice 'Yes' is a property of the ID, but not Site, here is an option.如果选择“是”是 ID 的属性,而不是站点的属性,这里有一个选项。

df3 <- df1 %>% 
select(-Sites) %>% 
  left_join(df2 %>% filter(Choice == "Yes"), by = "ID")

I guess I am just confused whether both Sites and ID's need to match.我想我只是对站点和 ID 是否需要匹配感到困惑。 If so their are just not many records here.如果是这样,他们在这里的记录并不多。

A data.table solution: data.table 解决方案:

df1[df2[ID %in% df2[Choice == "Yes", unique(ID)], ], 
    on = .(ID, Sites)][is.na(Density), 
                           Density := 0][]

What is in there:里面有什么:

  • df2[Choice == "Yes", unique(ID) filters where choice is yes and returns unique IDs. df2[Choice == "Yes", unique(ID)过滤选择为 yes 并返回唯一 ID。 We'll need them to filter df2 .我们需要它们来过滤df2
  • df2[ID %in% df2[Choice == "Yes", unique(ID)], ] filters only the cases where ID matches the IDs where there is at least one "yes" in choices. df2[ID %in% df2[Choice == "Yes", unique(ID)], ]仅过滤 ID 与选项中至少有一个“yes”的 ID 匹配的情况。
  • df1[df_x, on =.(ID, Sites)] makes a left join of df1 with whatever is in df_x (in our case, the filtered df2 . df1[df_x, on =.(ID, Sites)]df1df_x中的任何内容(在我们的例子中,过滤后的df2 )进行左连接。
  • [is.na(Density), Density:= 0] filters the rows where Density is NA and only in those rows assigns 0 to Density. [is.na(Density), Density:= 0]过滤 Density 为NA的行,并且仅在这些行中将 0 分配给 Density。
  • [] Prints to screen the resuling data, you don't really need it. []打印以筛选结果数据,您实际上并不需要它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM