Merge dataframes by a match in at least one of two columns

Question

I've been searching for a solution and have been experimenting, but I can't seem to perform what I should be a simple task.

I have two data frames formatted similar to the below toy examples

DF1 = data.frame(A=c("cats","dogs",NA,"dogs"), B=c("kittens","puppies","kittens",NA), C=c(88,99,101,110))

    A       B           C
1   cats    kittens     88
2   dogs    puppies     99
3   NA      kittens     101
4   dogs    NA          110


DF2 = data.frame(D=c(1,2), A=c("cats","dogs"), B=c("kittens","puppies"))

    D   A       B
1   1   cats    kittens
2   2   dogs    puppies

I wish to merge the two data sets such that the output is:

      A     B         C     D
1   cats    kittens   88    1
2   dogs    puppies   99    2
3   dogs    NA        110   2
4     NA    kittens   101   1

In other words, any rows with labels A=="cats" or B=="kittens" will be mapped to 1 in the column D, any rows with A=="dogs" or B=="puppies" will be mapped to 2.

I have used the command

merge(DF1, DF2, by=c("A","B"), all.x=TRUE)

However this not match rows 3 and 4 correctly, only rows 1 and 2. I get the output

      A     B         C     D
1   cats    kittens   88    1
2   dogs    puppies   99    2
3   dogs    NA        110   NA
4     NA    kittens   101   NA

Please note the actual datasets I'm working with are very long. In reality DF1 is over 1,000,000 rows and DF2 is over 300,000 rows thousands of rows each, so a solution that could be scaled is what I really need.

Answer 1

Perhaps you can try something along these lines:

temp <- merge(DF1, DF2, by=c("A","B"), all.x=TRUE)

within(temp, {
  M1 <- c("cats", "kittens")
  D <- ifelse(A %in% M1 | B %in% M1, 1, 2)
  rm(M1)
})
#      A       B   C D
# 1 cats kittens  88 1
# 2 dogs puppies  99 2
# 3 dogs    <NA> 110 2
# 4 <NA> kittens 101 1

You can nest ifelse statements if you need more than just these two options.

Answer 2

DF1[which(DF1$A=="cats"|DF1$B=="kittens"), "D"] <- DF2[which(DF2$A=="cats"|DF2$B=="kittens"), "D"]
DF1[which(DF1$A=="dogs"|DF1$B=="puppies"), "D"] <- DF2[which(DF2$A=="dogs"|DF2$B=="puppies"), "D"]
DF1
#-------
     A       B   C D
1 cats kittens  88 1
2 dogs puppies  99 2
3 <NA> kittens 101 1
4 dogs    <NA> 110 2

Functionalized:

idxpick <- function(a,b) DF1[which(DF1$A==a|DF1$B==b), "D"] <<- # Yes, I feel guilty.
                                   DF2[which(DF2$A==a|DF2$B==b), "D"]
DF1 = data.frame(A=c("cats","dogs",NA,"dogs"), 
                 B=c("kittens","puppies","kittens",NA), 
                 C=c(88,99,101,110))
DF2 = data.frame(D=c(1,2), A=c("cats","dogs"), B=c("kittens","puppies"))
apply(DF2, 1, function(rr) idxpick(rr["A"], rr["B"]) )
#------------
[1] 1 2

DF1
     A       B   C D
1 cats kittens  88 1
2 dogs puppies  99 2
3 <NA> kittens 101 1
4 dogs    <NA> 110 2

Answer 3

Here's a different approach:

library(functional)

partial.merge <- function(DF1, DF2) {
  common.cols <- intersect(names(DF1), names(DF2))
  result.col <- names(DF2)[!(names(DF2) %in% common.cols)]

  # This can only handle one result column:
  stopifnot(length(result.col) == 1)

  # Merge in each common column, one at a time.
  # The identical operation is done for each common column, so Reduce is useful:
  r <- Reduce(function(D, C) merge(D, DF2[c(C, result.col)], by=c(C), all.x=TRUE), x=common.cols, init=DF1)

  # The merge created cols like c('D.x', 'D.y').  These are the columns:
  merge.cols <- paste(result.col, c('x', 'y'), sep='.')

  # The .x and .y columns are partial, put them together:
  r[[result.col]] <- rowMeans(r[merge.cols], na.rm=TRUE)

  # Remove the temporaries:
  for (i in merge.cols) {
    r[[i]] <- NULL
  }
  return(r)
}

partial.merge(DF1, DF2)
##         B    A   C D
## 1 kittens cats  88 1
## 2 kittens <NA> 101 1
## 3 puppies dogs  99 2
## 4    <NA> dogs 110 2

Merge dataframes by a match in at least one of two columns

Question

3 answers

solution1
3 2013-04-30 06:55:09

solution2
2 2013-04-30 07:10:01

solution3
2 ACCPTED 2013-04-30 08:35:57

Merge dataframes by a match in at least one of two columns

Question

3 answers

solution1 3 2013-04-30 06:55:09

solution2 2 2013-04-30 07:10:01

solution3 2 ACCPTED 2013-04-30 08:35:57

solution1
3 2013-04-30 06:55:09

solution2
2 2013-04-30 07:10:01

solution3
2 ACCPTED 2013-04-30 08:35:57