[英]What is the most clean & efficient way of joining two tables (dataframes) in R?
I'm trying to find out the most efficient way of joining data from one dataframe into another.我试图找出将数据从一个数据帧连接到另一个数据帧的最有效方法。 The idea is that I have a master data set (df) and a secondary dataset (lookup).
这个想法是我有一个主数据集(df)和一个辅助数据集(查找)。 I want to append the the data in the lookup table to the master data set.
我想将查找表中的数据附加到主数据集。
Theoretical data as follows:理论数据如下:
COLUMN_A <- 1:5
COLUMN_B <- 1:5
LOOKUP_COL <- letters[1:5]
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL)
COLUMN_A COLUMN_B LOOKUP_COL
1 1 1 a
2 2 2 b
3 3 3 c
4 4 4 d
5 5 5 e
COLUMN_A <- 2*(1:5)
LOOKUP_COL <- letters[1:5]
SPARE_COL <- runif(5)
lookup <- data.frame(COLUMN_A,LOOKUP_COL,SPARE_COL)
COLUMN_A LOOKUP_COL SPARE_COL
1 1 a 0.6113499
2 2 b 0.3712987
3 3 c 0.3551038
4 4 d 0.6650248
5 5 e 0.2680611
This is how I've been doing it so far:到目前为止,我是这样做的:
results <- merge(df,lookup,by='LOOKUP_COL')
Which provides me with:这为我提供了:
LOOKUP_COL COLUMN_A.x COLUMN_B COLUMN_A.y SPARE_COL
1 a 1 1 1 0.6113499
2 b 2 2 2 0.3712987
3 c 3 3 3 0.3551038
4 d 4 4 4 0.6650248
5 e 5 5 5 0.2680611
So it seems that the entire lookup table has been merged into the master data, SPARE_COL is surplus to requirements - how can I control what columns get passed into the master data?所以看起来整个查找表已经合并到主数据中, SPARE_COL 是多余的 - 我如何控制哪些列被传递到主数据中? Essentially, I'm trying to understand how the functionality of an excel vlookup can be used in R.
本质上,我试图了解如何在 R 中使用 excel vlookup 的功能。
thanks谢谢
EDIT: This one uses SPARE_COL as the one to keep instead of COLUMN_A.编辑:这个使用 SPARE_COL 作为保留而不是 COLUMN_A。 If you have columns with the same name in different dataframes, the solution with indices will require that you rename them in one of the data frames before merging everything together.
如果您在不同的数据框中具有相同名称的列,则带有索引的解决方案将要求您在将所有内容合并在一起之前在其中一个数据框中重命名它们。
You can do this by passing only the columns you want to merge to the function merge
.您可以通过仅将要合并的列传递给函数
merge
来完成此操作。 Obviously you have to keep the columns used for merging in your selection.显然,您必须在选择中保留用于合并的列。 Taking your example, this becomes:
以您为例,这将变为:
keep <- c('LOOKUP_COL','SPARE_COL')
results <- merge(df,lookup[keep],by='LOOKUP_COL')
And the result is结果是
> results
LOOKUP_COL COLUMN_A COLUMN_B SPARE_COL
1 a 1 1 0.75670441
2 b 2 2 0.52122950
3 c 3 3 0.99338019
4 d 4 4 0.71904088
5 e 5 5 0.05405722
By selecting the columns first, you make merge
work faster and you don't have to bother about finding the columns you want after the merge.通过首先选择列,您可以更快地进行
merge
,而且您不必费心在合并后查找所需的列。
If speed is an issue and the merge is simple, you can speed things up by manually doing the merge using indices:如果速度是一个问题并且合并很简单,您可以通过使用索引手动进行合并来加快速度:
id <- match(df$LOOKUP_COL, lookup$LOOKUP_COL)
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
This gives the same result, and gives a good speedup.这给出了相同的结果,并提供了很好的加速。
Let's create an example with 2 lookup columns first:让我们首先创建一个包含 2 个查找列的示例:
N <- 10000
COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2)
COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)
lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL)
You can use merge again like this:您可以像这样再次使用合并:
keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))
And you can use indices again.您可以再次使用索引。 Before you match, you have to create the interaction between the lookup columns.
在匹配之前,您必须创建查找列之间的交互。 You can do this using the function
interaction()
for any number of lookup columns:您可以使用函数
interaction()
为任意数量的查找列执行此操作:
lookups <- c('LOOKUP_COL','LOOKUP_2')
id <- match(interaction(df[lookups]),
interaction(lookup[lookups]))
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
In the test below the speedup is about a 6-fold for the two-column case:在下面的测试中,对于两列情况,加速大约是 6 倍:
test replications elapsed relative user.self sys.self user.child
1 code1() 100 6.30 6.117 6.30 0 NA
2 code2() 100 1.03 1.000 1.03 0 NA
sys.child
1 NA
2 NA
The code for testing:测试代码:
N <- 10000
COLUMN_A <- 1:N
COLUMN_B <- 1:N
LOOKUP_COL <- sample(letters[3:7], N, replace = TRUE)
LOOKUP_2 <- sample(letters[10:14], N, replace = TRUE)
df <- data.frame(COLUMN_A,COLUMN_B,LOOKUP_COL, LOOKUP_2)
COLUMN_A <- 2*(1:36)
LOOKUP_COL <- rep(letters[1:6], each = 6)
LOOKUP_2 <- rep(letters[10:15], times = 6)
SPARE_COL <- runif(36)
lookup <- data.frame(COLUMN_A,LOOKUP_COL, LOOKUP_2, SPARE_COL)
code1 <- function(){
keep <- c('LOOKUP_COL','SPARE_COL', 'LOOKUP_2')
results <- merge(df,lookup[keep],by=c('LOOKUP_COL', 'LOOKUP_2'))
}
code2 <- function(){
lookups <- c('LOOKUP_COL','LOOKUP_2')
id <- match(interaction(df[lookups]),
interaction(lookup[lookups]))
keep <- c('SPARE_COL')
results <- df
results[keep] <- lookup[id,keep, drop = FALSE]
}
require(rbenchmark)
benchmark(code1(),code2())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.