What is the most clean & efficient way of joining two tables (dataframes) in R?

Question

I'm trying to find out the most efficient way of joining data from one dataframe into another. The idea is that I have a master data set (df) and a secondary dataset (lookup). I want to append the the data in the lookup table to the master data set.

Theoretical data as follows:

COLUMN_A <- 1:5
COLUMN_B <- 1:5
LOOKUP_COL <- letters[1:5]


df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL) 

  COLUMN_A COLUMN_B LOOKUP_COL
1        1        1          a
2        2        2          b
3        3        3          c
4        4        4          d
5        5        5          e

COLUMN_A <- 2*(1:5)
LOOKUP_COL <- letters[1:5]
SPARE_COL <- runif(5)

lookup <- data.frame(COLUMN_A,LOOKUP_COL,SPARE_COL) 

  COLUMN_A LOOKUP_COL SPARE_COL
1        1          a 0.6113499
2        2          b 0.3712987
3        3          c 0.3551038
4        4          d 0.6650248
5        5          e 0.2680611

This is how I've been doing it so far:

results <- merge(df,lookup,by='LOOKUP_COL')

Which provides me with:

  LOOKUP_COL COLUMN_A.x COLUMN_B COLUMN_A.y SPARE_COL
1          a          1        1          1 0.6113499
2          b          2        2          2 0.3712987
3          c          3        3          3 0.3551038
4          d          4        4          4 0.6650248
5          e          5        5          5 0.2680611

So it seems that the entire lookup table has been merged into the master data, SPARE_COL is surplus to requirements - how can I control what columns get passed into the master data? Essentially, I'm trying to understand how the functionality of an excel vlookup can be used in R.

thanks

Answer 1

EDIT: This one uses SPARE_COL as the one to keep instead of COLUMN_A. If you have columns with the same name in different dataframes, the solution with indices will require that you rename them in one of the data frames before merging everything together.

Single column

You can do this by passing only the columns you want to merge to the function merge . Obviously you have to keep the columns used for merging in your selection. Taking your example, this becomes:

keep <- c('LOOKUP_COL','SPARE_COL')
results <- merge(df,lookup[keep],by='LOOKUP_COL')

And the result is

> results
  LOOKUP_COL COLUMN_A COLUMN_B  SPARE_COL
1          a        1        1 0.75670441
2          b        2        2 0.52122950
3          c        3        3 0.99338019
4          d        4        4 0.71904088
5          e        5        5 0.05405722

By selecting the columns first, you make merge work faster and you don't have to bother about finding the columns you want after the merge.

If speed is an issue and the merge is simple, you can speed things up by manually doing the merge using indices:

id <- match(df$LOOKUP_COL, lookup$LOOKUP_COL)
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]

This gives the same result, and gives a good speedup.

more columns

Let's create an example with 2 lookup columns first:

N <- 10000

COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)

df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2) 

COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)

lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL)

You can use merge again like this:

keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))

And you can use indices again. Before you match, you have to create the interaction between the lookup columns. You can do this using the function interaction() for any number of lookup columns:

  lookups <- c('LOOKUP_COL','LOOKUP_2')
  id <- match(interaction(df[lookups]), 
              interaction(lookup[lookups]))
  keep <- c('SPARE_COL')
  results <- df
  results[keep] <- lookup[id,keep, drop = FALSE]

Timing

In the test below the speedup is about a 6-fold for the two-column case:

     test replications elapsed relative user.self sys.self user.child
1 code1()          100    6.30    6.117      6.30        0         NA
2 code2()          100    1.03    1.000      1.03        0         NA
  sys.child
1        NA
2        NA

The code for testing:

N <- 10000

COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)


df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2) 

COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)

lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL) 

code1 <- function(){
  keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
  results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))
}

code2 <- function(){
  lookups <- c('LOOKUP_COL','LOOKUP_2')
  id <- match(interaction(df[lookups]), 
              interaction(lookup[lookups]))
  keep <- c('SPARE_COL')
  results <- df
  results[keep] <- lookup[id,keep, drop = FALSE]
}

require(rbenchmark)

benchmark(code1(),code2())

Answer 2

For manipulating and merging dataframes, I suggest package dplyr :

library(dplyr)
df %>%
  left_join(lookup, by=c("LOOKUP_COL")) %>%
  select(LOOKUP_COL, COLUMN_A=COLUMN_A.x, COLUMN_B, COLUMN_C=COLUMN_A.y)

What is the most clean & efficient way of joining two tables (dataframes) in R?

Question

2 answers

solution1
1 ACCPTED 2017-01-24 13:52:53

Single column

more columns

Timing

solution2
0 2017-01-24 16:50:10

What is the most clean & efficient way of joining two tables (dataframes) in R?

Question

2 answers

solution1 1 ACCPTED 2017-01-24 13:52:53

Single column

more columns

Timing

solution2 0 2017-01-24 16:50:10

solution1
1 ACCPTED 2017-01-24 13:52:53

solution2
0 2017-01-24 16:50:10