使用完全连接来匹配两个不同数据帧中的值

Question

I have two dataframes.The first data frame consists of four columns 1) ID, 2) Site, 3) Depth, and 4) Density.我有两个数据框。第一个数据框由四列组成：1）ID、2）站点、3）深度和 4）密度。 The second dataframe consists of 1) ID 2) Site, and 3) Choice.第二个 dataframe 由 1) ID 2) 站点和 3) 选择组成。

df1 df1

  ID Sites Depth Density
  1     B   0.2       0
  2     B   0.2       1
  3     D   0.3       0
  4     D   0.3       1
  5     B   0.2       2

df2 df2

  ID Sites Choice 
  1     A    No
  1     B    Yes     
  1     C    No
  1     D    No
  2     A    No
  2     B    Yes
  2     C    No
  2     D    No
  3     A    No
  3     B    No
  3     C    No
  3     D    Yes
  4     A    No
  4     B    No
  4     C    No
  4     D    Yes
  5     A    No
  5     B    Yes
  5     C    No
  5     D    No

What I am trying to do is add a column to df2 that has the densities in each site when the ID has a "Yes".我要做的是在 df2 中添加一个列，当 ID 为“是”时，该列具有每个站点中的密度。 Below is what I want the output to be:下面是我想要的 output 是：

Desired Output所需 Output

  ID Sites Choice Depth  Density
  1     A    No     0.1     0
  1     B    Yes    0.2     0 
  1     C    No     0.3     0 
  1     D    No     0.4     0
  2     A    No     0.1     0
  2     B    Yes    0.2     1
  2     C    No     0.3     0
  2     D    No     0.4     0
  3     A    No     0.1     0
  3     B    No     0.2     1
  3     C    No     0.3     0
  3     D    Yes    0.4     0
  4     A    No     0.1     0
  4     B    No     0.2     1
  4     C    No     0.3     0
  4     D    Yes    0.4     1
  5     A    No     0.1     0
  5     B    Yes    0.2     2
  5     C    No     0.3     0
  5     D    No     0.4     1

I've tried using the following but it doesn't work:我尝试使用以下方法，但它不起作用：

     df3<-df2 %>%
     full_join(df1, by = c("ID", "Sites")) %>%
     group_by(ID) %>%
     mutate(Density = Density[Choice == "Yes"]) %>%
     distinct(ID, Sites, .keep_all = TRUE)

Thank you for your help, stackoverflow community.感谢您的帮助，stackoverflow 社区。

Answer 1

Is this what you are looking for?这是你想要的？

df1<-data.frame(stringsAsFactors=FALSE,
                ID = c(1, 2, 3, 4, 5),
                Sites = c("B", "B", "D", "D", "B"),
                Depth = c(0.2, 0.2, 0.3, 0.3, 0.2),
                Density = c(0, 1, 0, 1, 2)
)

df2<-data.frame(stringsAsFactors=FALSE,
                ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5),
                Sites = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D",
                          "A", "B", "C", "D", "A", "B", "C", "D"),
                Choice = c("No", "Yes", "No", "No", "No", "Yes", "No", "No", "No", "No",
                           "No", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No",
                           "No")
)

df3<-df2 %>%
  left_join(df1, by = c("ID", "Sites")) %>%
  mutate(Density=if_else(Choice=="Yes",Density, 0))

Answer 2

Data.数据。

  df1 <- data.frame(
      ID = seq(from = 1, to = 5, by = 1),
      Sites = c("B", "B", "D", "D", "B"),
      Depth = c(0.2, 0.2, 0.3, 0.3, 0.2),
      Density = c(0,1,0,1,2)
    )

qw <- c("No", "Yes")
  
df2 <- data.frame(
  ID = c(rep(1:5, times = 5)),
  Sites = sample(LETTERS[1:4], size = 25, replace = TRUE),
  Choice = sample(qw, size = 25, replace = T)
)

But I am slightly confused by what you need from this reprex.但是我对你需要从这个代表中得到什么感到有些困惑。 Your statement makes it seem that you want a subsetted DF3, which only contains records where CHOICE == "YES" is TRUE, and ID = ID, and Site = Site.您的陈述使您似乎想要一个子集的 DF3，它只包含 CHOICE == "YES" 为 TRUE、ID = ID 和 Site = Site 的记录。 Obviously this will result in very few records.显然，这将导致很少的记录。

df3 <- merge(df1, df2, by = c("Sites", "ID")) %>% 
filter(Choice == "Yes") # only one record from the dummy data.

If the Choice 'Yes' is a property of the ID, but not Site, here is an option.如果选择“是”是 ID 的属性，而不是站点的属性，这里有一个选项。

df3 <- df1 %>% 
select(-Sites) %>% 
  left_join(df2 %>% filter(Choice == "Yes"), by = "ID")

I guess I am just confused whether both Sites and ID's need to match.我想我只是对站点和 ID 是否需要匹配感到困惑。 If so their are just not many records here.如果是这样，他们在这里的记录并不多。

Answer 3

A data.table solution: data.table 解决方案：

df1[df2[ID %in% df2[Choice == "Yes", unique(ID)], ], 
    on = .(ID, Sites)][is.na(Density), 
                           Density := 0][]

What is in there:里面有什么：

df2[Choice == "Yes", unique(ID) filters where choice is yes and returns unique IDs. df2[Choice == "Yes", unique(ID)过滤选择为 yes 并返回唯一 ID。 We'll need them to filter df2 .我们需要它们来过滤df2 。
df2[ID %in% df2[Choice == "Yes", unique(ID)], ] filters only the cases where ID matches the IDs where there is at least one "yes" in choices. df2[ID %in% df2[Choice == "Yes", unique(ID)], ]仅过滤 ID 与选项中至少有一个“yes”的 ID 匹配的情况。
df1[df_x, on =.(ID, Sites)] makes a left join of df1 with whatever is in df_x (in our case, the filtered df2 . df1[df_x, on =.(ID, Sites)]将df1与df_x中的任何内容（在我们的例子中，过滤后的df2 ）进行左连接。
[is.na(Density), Density:= 0] filters the rows where Density is NA and only in those rows assigns 0 to Density. [is.na(Density), Density:= 0]过滤 Density 为NA的行，并且仅在这些行中将 0 分配给 Density。
[] Prints to screen the resuling data, you don't really need it. []打印以筛选结果数据，您实际上并不需要它。

使用完全连接来匹配两个不同数据帧中的值

问题描述

3 个解决方案

解决方案1
0 2021-04-11 22:19:33

解决方案2
0 2021-04-11 22:32:03

解决方案3
0 2021-04-11 23:38:46

What is in there:里面有什么：

使用完全连接来匹配两个不同数据帧中的值

问题描述

3 个解决方案

解决方案1 0 2021-04-11 22:19:33

解决方案2 0 2021-04-11 22:32:03

解决方案3 0 2021-04-11 23:38:46

What is in there:里面有什么：

解决方案1
0 2021-04-11 22:19:33

解决方案2
0 2021-04-11 22:32:03

解决方案3
0 2021-04-11 23:38:46