I am attempting to apply a function to two data sets df1
and df2
where df1
contains (a, b)
and can be 1 million rows long, and df2
contains (x, y, z)
and can be very large, anywhere from ~100 to >10,000. I would like to apply a function foo
over every combination of both data sets and then sum over the second data set.
foo <- function(a, b, x, y, z) a + b + x + y + z
df1 <- data.frame(a = 1:10, b = 11:20)
df2 <- data.frame(x= 1:5, y = 21:25, z = 31:35)
The code I am using to apply this function (taken from @jlhoward here How to avoid multiple loops with multiple variables in R )
foo.new <- function(p1, p2) {
p1 = as.list(p1); p2 = as.list(p2)
foo(p1$a, p1$b, p2$x, p2$y, p2$z)
}
indx <- expand.grid(indx2 = seq(nrow(df2)), indx1 = seq(nrow(df1)))
result <- with(indx, foo.new(df1[indx1, ], df2[indx2, ]))
sums <- aggregate(result, by = list(rep(seq(nrow(df1)), each = nrow(df2))), sum)
However, as df2
gets large (>1000) I quickly run out of memory to perform the result
function above (running 64bit PC with 32GB RAM).
I have read about data.table
quite a bit but can't evaluate whether there is a function in there that would assist in saving memory. Something that would replace with
and create a smaller file at the result
step, or expand.grid
at the index
step, which creates the largest file by far.
Here is a data.table solution: should be pretty fast:
library(data.table)
indx<-CJ(indx1=seq(nrow(df2)),indx2=seq(nrow(df1))) #CJ is data.table function for expand.grid
indx[,`:=`(result=foo.new(df1[indx1, ], df2[indx2, ]),Group.1=rep(seq(nrow(df1)), each = nrow(df2)))][,.(sums=sum(result)),by=Group.1]
Group.1 sums
1: 1 355
2: 2 365
3: 3 375
4: 4 385
5: 5 395
6: 6 405
7: 7 415
8: 8 425
9: 9 435
10: 10 445
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.