[英]Merge dataframes by a match in at least one of two columns
I've been searching for a solution and have been experimenting, but I can't seem to perform what I should be a simple task. 我一直在寻找解决方案并且一直在尝试,但我似乎无法执行我应该做的简单任务。
I have two data frames formatted similar to the below toy examples 我有两个数据帧格式类似于下面的玩具示例
DF1 = data.frame(A=c("cats","dogs",NA,"dogs"), B=c("kittens","puppies","kittens",NA), C=c(88,99,101,110))
A B C
1 cats kittens 88
2 dogs puppies 99
3 NA kittens 101
4 dogs NA 110
DF2 = data.frame(D=c(1,2), A=c("cats","dogs"), B=c("kittens","puppies"))
D A B
1 1 cats kittens
2 2 dogs puppies
I wish to merge the two data sets such that the output is: 我希望合并两个数据集,使输出为:
A B C D
1 cats kittens 88 1
2 dogs puppies 99 2
3 dogs NA 110 2
4 NA kittens 101 1
In other words, any rows with labels A=="cats" or B=="kittens" will be mapped to 1 in the column D, any rows with A=="dogs" or B=="puppies" will be mapped to 2. 换句话说,任何带有标签A ==“cats”或B ==“kittens”的行都将映射到D列中的1,任何具有A ==“dogs”或B ==“puppies”的行都将被映射到2。
I have used the command 我用过这个命令
merge(DF1, DF2, by=c("A","B"), all.x=TRUE)
However this not match rows 3 and 4 correctly, only rows 1 and 2. I get the output 但是这不正确地匹配第3行和第4行,只有第1行和第2行。我得到了输出
A B C D
1 cats kittens 88 1
2 dogs puppies 99 2
3 dogs NA 110 NA
4 NA kittens 101 NA
Please note the actual datasets I'm working with are very long. 请注意我正在使用的实际数据集非常长。 In reality DF1 is over 1,000,000 rows and DF2 is over 300,000 rows thousands of rows each, so a solution that could be scaled is what I really need.
实际上DF1超过1,000,000行,而DF2每行超过300,000行数千行,因此可以扩展的解决方案是我真正需要的。
Perhaps you can try something along these lines: 也许你可以尝试这些方面:
temp <- merge(DF1, DF2, by=c("A","B"), all.x=TRUE)
within(temp, {
M1 <- c("cats", "kittens")
D <- ifelse(A %in% M1 | B %in% M1, 1, 2)
rm(M1)
})
# A B C D
# 1 cats kittens 88 1
# 2 dogs puppies 99 2
# 3 dogs <NA> 110 2
# 4 <NA> kittens 101 1
You can nest ifelse
statements if you need more than just these two options. 如果您需要的不仅仅是这两个选项,您可以嵌套
ifelse
语句。
DF1[which(DF1$A=="cats"|DF1$B=="kittens"), "D"] <- DF2[which(DF2$A=="cats"|DF2$B=="kittens"), "D"]
DF1[which(DF1$A=="dogs"|DF1$B=="puppies"), "D"] <- DF2[which(DF2$A=="dogs"|DF2$B=="puppies"), "D"]
DF1
#-------
A B C D
1 cats kittens 88 1
2 dogs puppies 99 2
3 <NA> kittens 101 1
4 dogs <NA> 110 2
Functionalized: 功能:
idxpick <- function(a,b) DF1[which(DF1$A==a|DF1$B==b), "D"] <<- # Yes, I feel guilty.
DF2[which(DF2$A==a|DF2$B==b), "D"]
DF1 = data.frame(A=c("cats","dogs",NA,"dogs"),
B=c("kittens","puppies","kittens",NA),
C=c(88,99,101,110))
DF2 = data.frame(D=c(1,2), A=c("cats","dogs"), B=c("kittens","puppies"))
apply(DF2, 1, function(rr) idxpick(rr["A"], rr["B"]) )
#------------
[1] 1 2
DF1
A B C D
1 cats kittens 88 1
2 dogs puppies 99 2
3 <NA> kittens 101 1
4 dogs <NA> 110 2
Here's a different approach: 这是一种不同的方法:
library(functional)
partial.merge <- function(DF1, DF2) {
common.cols <- intersect(names(DF1), names(DF2))
result.col <- names(DF2)[!(names(DF2) %in% common.cols)]
# This can only handle one result column:
stopifnot(length(result.col) == 1)
# Merge in each common column, one at a time.
# The identical operation is done for each common column, so Reduce is useful:
r <- Reduce(function(D, C) merge(D, DF2[c(C, result.col)], by=c(C), all.x=TRUE), x=common.cols, init=DF1)
# The merge created cols like c('D.x', 'D.y'). These are the columns:
merge.cols <- paste(result.col, c('x', 'y'), sep='.')
# The .x and .y columns are partial, put them together:
r[[result.col]] <- rowMeans(r[merge.cols], na.rm=TRUE)
# Remove the temporaries:
for (i in merge.cols) {
r[[i]] <- NULL
}
return(r)
}
partial.merge(DF1, DF2)
## B A C D
## 1 kittens cats 88 1
## 2 kittens <NA> 101 1
## 3 puppies dogs 99 2
## 4 <NA> dogs 110 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.