I have a very large dataframe that I need join to another dataframe on two columns. I've been using merge to accomplish ir, but R runs out of memory the larger the tables get. Is there a similar solution using dplyr or plyr? I hear they require substantially less memory to accomplish. I know how to use the join function in plyr generally, what I am struggling with is joining by two columns. The merge synatx I've been using is below:
Correlation_Table <- merge(Correlation_Table, inter, by.x = c(1,2), by.y = c(1,2), all.x = TRUE, all.y = TRUE)
So for example if I have the following two dataframes:
> head(df1)
x y z a
1 1 2 429.57410 43.746670
2 2 3 717.98184 524.288886
3 3 4 601.66938 640.245469
4 4 5 87.41476 318.964765
5 5 6 586.22234 196.759991
6 6 7 619.82194 3.308136
> head(df2)
b c d
1 5 8 152.2855
2 6 9 191.5406
3 7 10 197.0520
4 8 11 175.4209
5 9 12 157.6239
6 10 13 136.3286
Where columns x and y of df1 are dimensions, while columns b and c of df2 are also dimensions and the other columns are measures. My goal here is create a new dataframe of all three measures where records of df1.x and df1.y match df2.a and df2.b.
Is this possible using plyr?
You can try
library(dplyr)
res1 <- full_join(df1, df2, by=c('x'='b', 'y'='c'))
According to ?full_join
by: a character vector of variables to join by. If 'NULL', the default, 'join' will do a natural join, using all variables with common names across the two tables. A message lists the variables so that you can check they're right. To join by different variables on x and y use a named vector. For example, 'by = c("a" = "b")' will match 'x.a' to 'y.b'.
and compare the results with
res2 <- merge(df1, df2, by.x = c(1,2), by.y = c(1,2),
all.x = TRUE, all.y = TRUE)
NOTE: The order of rows will be different
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.