在 R 中连接两个表（数据框）的最干净和最有效的方法是什么？

Question

I'm trying to find out the most efficient way of joining data from one dataframe into another.我试图找出将数据从一个数据帧连接到另一个数据帧的最有效方法。 The idea is that I have a master data set (df) and a secondary dataset (lookup).这个想法是我有一个主数据集（df）和一个辅助数据集（查找）。 I want to append the the data in the lookup table to the master data set.我想将查找表中的数据附加到主数据集。

Theoretical data as follows:理论数据如下：

COLUMN_A <- 1:5
COLUMN_B <- 1:5
LOOKUP_COL <- letters[1:5]


df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL) 

  COLUMN_A COLUMN_B LOOKUP_COL
1        1        1          a
2        2        2          b
3        3        3          c
4        4        4          d
5        5        5          e

COLUMN_A <- 2*(1:5)
LOOKUP_COL <- letters[1:5]
SPARE_COL <- runif(5)

lookup <- data.frame(COLUMN_A,LOOKUP_COL,SPARE_COL) 

  COLUMN_A LOOKUP_COL SPARE_COL
1        1          a 0.6113499
2        2          b 0.3712987
3        3          c 0.3551038
4        4          d 0.6650248
5        5          e 0.2680611

This is how I've been doing it so far:到目前为止，我是这样做的：

results <- merge(df,lookup,by='LOOKUP_COL')

Which provides me with:这为我提供了：

  LOOKUP_COL COLUMN_A.x COLUMN_B COLUMN_A.y SPARE_COL
1          a          1        1          1 0.6113499
2          b          2        2          2 0.3712987
3          c          3        3          3 0.3551038
4          d          4        4          4 0.6650248
5          e          5        5          5 0.2680611

So it seems that the entire lookup table has been merged into the master data, SPARE_COL is surplus to requirements - how can I control what columns get passed into the master data?所以看起来整个查找表已经合并到主数据中， SPARE_COL 是多余的 - 我如何控制哪些列被传递到主数据中？ Essentially, I'm trying to understand how the functionality of an excel vlookup can be used in R.本质上，我试图了解如何在 R 中使用 excel vlookup 的功能。

thanks谢谢

Answer 1

EDIT: This one uses SPARE_COL as the one to keep instead of COLUMN_A.编辑：这个使用 SPARE_COL 作为保留而不是 COLUMN_A。 If you have columns with the same name in different dataframes, the solution with indices will require that you rename them in one of the data frames before merging everything together.如果您在不同的数据框中具有相同名称的列，则带有索引的解决方案将要求您在将所有内容合并在一起之前在其中一个数据框中重命名它们。

Single column单列

You can do this by passing only the columns you want to merge to the function merge .您可以通过仅将要合并的列传递给函数merge来完成此操作。 Obviously you have to keep the columns used for merging in your selection.显然，您必须在选择中保留用于合并的列。 Taking your example, this becomes:以您为例，这将变为：

keep <- c('LOOKUP_COL','SPARE_COL')
results <- merge(df,lookup[keep],by='LOOKUP_COL')

And the result is结果是

> results
  LOOKUP_COL COLUMN_A COLUMN_B  SPARE_COL
1          a        1        1 0.75670441
2          b        2        2 0.52122950
3          c        3        3 0.99338019
4          d        4        4 0.71904088
5          e        5        5 0.05405722

By selecting the columns first, you make merge work faster and you don't have to bother about finding the columns you want after the merge.通过首先选择列，您可以更快地进行merge ，而且您不必费心在合并后查找所需的列。

If speed is an issue and the merge is simple, you can speed things up by manually doing the merge using indices:如果速度是一个问题并且合并很简单，您可以通过使用索引手动进行合并来加快速度：

id <- match(df$LOOKUP_COL, lookup$LOOKUP_COL)
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]

This gives the same result, and gives a good speedup.这给出了相同的结果，并提供了很好的加速。

more columns更多栏目

Let's create an example with 2 lookup columns first:让我们首先创建一个包含 2 个查找列的示例：

N <- 10000

COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)

df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2) 

COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)

lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL)

You can use merge again like this:您可以像这样再次使用合并：

keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))

And you can use indices again.您可以再次使用索引。 Before you match, you have to create the interaction between the lookup columns.在匹配之前，您必须创建查找列之间的交互。 You can do this using the function interaction() for any number of lookup columns:您可以使用函数interaction()为任意数量的查找列执行此操作：

  lookups <- c('LOOKUP_COL','LOOKUP_2')
  id <- match(interaction(df[lookups]), 
              interaction(lookup[lookups]))
  keep <- c('SPARE_COL')
  results <- df
  results[keep] <- lookup[id,keep, drop = FALSE]

Timing定时

In the test below the speedup is about a 6-fold for the two-column case:在下面的测试中，对于两列情况，加速大约是 6 倍：

     test replications elapsed relative user.self sys.self user.child
1 code1()          100    6.30    6.117      6.30        0         NA
2 code2()          100    1.03    1.000      1.03        0         NA
  sys.child
1        NA
2        NA

The code for testing:测试代码：

N <- 10000

COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)


df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2) 

COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)

lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL) 

code1 <- function(){
  keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
  results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))
}

code2 <- function(){
  lookups <- c('LOOKUP_COL','LOOKUP_2')
  id <- match(interaction(df[lookups]), 
              interaction(lookup[lookups]))
  keep <- c('SPARE_COL')
  results <- df
  results[keep] <- lookup[id,keep, drop = FALSE]
}

require(rbenchmark)

benchmark(code1(),code2())

Answer 2

For manipulating and merging dataframes, I suggest package dplyr :对于操作和合并数据帧，我建议包dplyr ：

library(dplyr)
df %>%
  left_join(lookup, by=c("LOOKUP_COL")) %>%
  select(LOOKUP_COL, COLUMN_A=COLUMN_A.x, COLUMN_B, COLUMN_C=COLUMN_A.y)

在 R 中连接两个表（数据框）的最干净和最有效的方法是什么？

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-01-24 13:52:53

Single column单列

more columns更多栏目

Timing定时

解决方案2
0 2017-01-24 16:50:10

在 R 中连接两个表（数据框）的最干净和最有效的方法是什么？

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-01-24 13:52:53

Single column单列

more columns更多栏目

Timing定时

解决方案2 0 2017-01-24 16:50:10

解决方案1
1 已采纳 2017-01-24 13:52:53

解决方案2
0 2017-01-24 16:50:10