简体   繁体   English

使用plyr在两列上连接两个海量数据框

[英]Using plyr to join two massive dataframes on two columns

I have a very large dataframe that I need join to another dataframe on two columns. 我有一个非常大的数据框,需要连接到两列上的另一个数据框。 I've been using merge to accomplish ir, but R runs out of memory the larger the tables get. 我一直在使用merge来完成ir,但是R会在表越大的情况下耗尽内存。 Is there a similar solution using dplyr or plyr? 是否有使用dplyr或plyr的类似解决方案? I hear they require substantially less memory to accomplish. 我听说他们需要更少的内存来完成任务。 I know how to use the join function in plyr generally, what I am struggling with is joining by two columns. 我知道一般如何在plyr中使用join函数,我正在努力的是通过两列进行连接。 The merge synatx I've been using is below: 我一直在使用的合并synatx如下:

Correlation_Table <- merge(Correlation_Table, inter, by.x = c(1,2), by.y = c(1,2), all.x = TRUE, all.y = TRUE)

So for example if I have the following two dataframes: 因此,例如,如果我有以下两个数据框:

> head(df1)
  x y         z          a
1 1 2 429.57410  43.746670
2 2 3 717.98184 524.288886
3 3 4 601.66938 640.245469
4 4 5  87.41476 318.964765
5 5 6 586.22234 196.759991
6 6 7 619.82194   3.308136
> head(df2)
   b  c        d
1  5  8 152.2855
2  6  9 191.5406
3  7 10 197.0520
4  8 11 175.4209
5  9 12 157.6239
6 10 13 136.3286

Where columns x and y of df1 are dimensions, while columns b and c of df2 are also dimensions and the other columns are measures. 其中df1的x列和y列是尺寸,而df2的b列和c列也是尺寸,其他列是度量。 My goal here is create a new dataframe of all three measures where records of df1.x and df1.y match df2.a and df2.b. 我的目标是创建一个包含所有三个度量的新数据框,其中df1.x和df1.y的记录与df2.a和df2.b匹配。

Is this possible using plyr? 使用plyr可以吗?

You can try 你可以试试

library(dplyr)
res1 <- full_join(df1, df2, by=c('x'='b', 'y'='c'))

According to ?full_join 根据?full_join

by: a character vector of variables to join by. by:要加入的变量的字符向量。 If 'NULL', the default, 'join' will do a natural join, using all variables with common names across the two tables. 如果为'NULL',则默认的'join'将使用两个表中具有相同名称的所有变量进行自然连接。 A message lists the variables so that you can check they're right. 一条消息列出了变量,以便您可以检查它们是否正确。 To join by different variables on x and y use a named vector. 要通过x和y上的不同变量进行联接,请使用命名向量。 For example, 'by = c("a" = "b")' will match 'x.a' to 'y.b'. 例如,“ by = c(“ a” =“ b”)”将匹配“ x.a”与“ y.b”。

and compare the results with 并将结果与

res2 <-  merge(df1, df2, by.x = c(1,2), by.y = c(1,2),
                           all.x = TRUE, all.y = TRUE)

NOTE: The order of rows will be different 注意:行的顺序将有所不同

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM