[英]Using full join to match values in two different dataframes
I have two dataframes.The first data frame consists of four columns 1) ID, 2) Site, 3) Depth, and 4) Density.我有两个数据框。第一个数据框由四列组成:1)ID、2)站点、3)深度和 4)密度。 The second dataframe consists of 1) ID 2) Site, and 3) Choice.第二个 dataframe 由 1) ID 2) 站点和 3) 选择组成。
df1 df1
ID Sites Depth Density
1 B 0.2 0
2 B 0.2 1
3 D 0.3 0
4 D 0.3 1
5 B 0.2 2
df2 df2
ID Sites Choice
1 A No
1 B Yes
1 C No
1 D No
2 A No
2 B Yes
2 C No
2 D No
3 A No
3 B No
3 C No
3 D Yes
4 A No
4 B No
4 C No
4 D Yes
5 A No
5 B Yes
5 C No
5 D No
What I am trying to do is add a column to df2 that has the densities in each site when the ID has a "Yes".我要做的是在 df2 中添加一个列,当 ID 为“是”时,该列具有每个站点中的密度。 Below is what I want the output to be:下面是我想要的 output 是:
Desired Output所需 Output
ID Sites Choice Depth Density
1 A No 0.1 0
1 B Yes 0.2 0
1 C No 0.3 0
1 D No 0.4 0
2 A No 0.1 0
2 B Yes 0.2 1
2 C No 0.3 0
2 D No 0.4 0
3 A No 0.1 0
3 B No 0.2 1
3 C No 0.3 0
3 D Yes 0.4 0
4 A No 0.1 0
4 B No 0.2 1
4 C No 0.3 0
4 D Yes 0.4 1
5 A No 0.1 0
5 B Yes 0.2 2
5 C No 0.3 0
5 D No 0.4 1
I've tried using the following but it doesn't work:我尝试使用以下方法,但它不起作用:
df3<-df2 %>%
full_join(df1, by = c("ID", "Sites")) %>%
group_by(ID) %>%
mutate(Density = Density[Choice == "Yes"]) %>%
distinct(ID, Sites, .keep_all = TRUE)
Thank you for your help, stackoverflow community.感谢您的帮助,stackoverflow 社区。
Is this what you are looking for?这是你想要的?
df1<-data.frame(stringsAsFactors=FALSE,
ID = c(1, 2, 3, 4, 5),
Sites = c("B", "B", "D", "D", "B"),
Depth = c(0.2, 0.2, 0.3, 0.3, 0.2),
Density = c(0, 1, 0, 1, 2)
)
df2<-data.frame(stringsAsFactors=FALSE,
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5),
Sites = c("A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C", "D",
"A", "B", "C", "D", "A", "B", "C", "D"),
Choice = c("No", "Yes", "No", "No", "No", "Yes", "No", "No", "No", "No",
"No", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No",
"No")
)
df3<-df2 %>%
left_join(df1, by = c("ID", "Sites")) %>%
mutate(Density=if_else(Choice=="Yes",Density, 0))
Data.数据。
df1 <- data.frame(
ID = seq(from = 1, to = 5, by = 1),
Sites = c("B", "B", "D", "D", "B"),
Depth = c(0.2, 0.2, 0.3, 0.3, 0.2),
Density = c(0,1,0,1,2)
)
qw <- c("No", "Yes")
df2 <- data.frame(
ID = c(rep(1:5, times = 5)),
Sites = sample(LETTERS[1:4], size = 25, replace = TRUE),
Choice = sample(qw, size = 25, replace = T)
)
But I am slightly confused by what you need from this reprex.但是我对你需要从这个代表中得到什么感到有些困惑。 Your statement makes it seem that you want a subsetted DF3, which only contains records where CHOICE == "YES" is TRUE, and ID = ID, and Site = Site.您的陈述使您似乎想要一个子集的 DF3,它只包含 CHOICE == "YES" 为 TRUE、ID = ID 和 Site = Site 的记录。 Obviously this will result in very few records.显然,这将导致很少的记录。
df3 <- merge(df1, df2, by = c("Sites", "ID")) %>%
filter(Choice == "Yes") # only one record from the dummy data.
If the Choice 'Yes' is a property of the ID, but not Site, here is an option.如果选择“是”是 ID 的属性,而不是站点的属性,这里有一个选项。
df3 <- df1 %>%
select(-Sites) %>%
left_join(df2 %>% filter(Choice == "Yes"), by = "ID")
I guess I am just confused whether both Sites and ID's need to match.我想我只是对站点和 ID 是否需要匹配感到困惑。 If so their are just not many records here.如果是这样,他们在这里的记录并不多。
A data.table solution: data.table 解决方案:
df1[df2[ID %in% df2[Choice == "Yes", unique(ID)], ],
on = .(ID, Sites)][is.na(Density),
Density := 0][]
df2[Choice == "Yes", unique(ID)
filters where choice is yes and returns unique IDs. df2[Choice == "Yes", unique(ID)
过滤选择为 yes 并返回唯一 ID。 We'll need them to filter df2
.我们需要它们来过滤df2
。df2[ID %in% df2[Choice == "Yes", unique(ID)], ]
filters only the cases where ID matches the IDs where there is at least one "yes" in choices. df2[ID %in% df2[Choice == "Yes", unique(ID)], ]
仅过滤 ID 与选项中至少有一个“yes”的 ID 匹配的情况。df1[df_x, on =.(ID, Sites)]
makes a left join of df1
with whatever is in df_x
(in our case, the filtered df2
. df1[df_x, on =.(ID, Sites)]
将df1
与df_x
中的任何内容(在我们的例子中,过滤后的df2
)进行左连接。[is.na(Density), Density:= 0]
filters the rows where Density is NA
and only in those rows assigns 0 to Density. [is.na(Density), Density:= 0]
过滤 Density 为NA
的行,并且仅在这些行中将 0 分配给 Density。[]
Prints to screen the resuling data, you don't really need it. []
打印以筛选结果数据,您实际上并不需要它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.