简体   繁体   中英

R: casting large data frames

I am having issues casting a data frame that is rather large, bumping into memory issues. Alternatively, there is probably a better way to do this. I am open to suggestions on either front to make it work better. The issue is such;

library(reshape)
dataf <- data.frame( gridID = rep(c(1,2,3,4),4000), montecarlo = rep(1:1000,each=4), number=runif(1600,0,1) )
castData <- cast(dataf, gridID ~ montecarlo, value='number')

This takes an incredibly long time for some of my data sets. Think a data frame that has 500,000 unique gridID values with 1000 montecarlo simulations for each (5,000,000 rows of data).

I'm getting this error as I write this question: Aggregation requires fun.aggregate: length used as default

However the coding is working in my script.... with no errors or warnings, it just takes a long time for my larger data frames. I am trying to avoid using a function (sum, mean, etc) on the value as there can only be one value per gridID ~ montecarlo and I figured that was also a large waste of time due to the computation.

The newly cast data frame is then multiplied by another data frame in the same format, 500,000 rows of data with 1000 columns (each representing the monte carlo iteration value), and goes through some more processes.

Any suggestions for dealing with these large data frames or speeding things up?

As mentioned, using the data.table package will help drastically. The code below generates two data frames with 100,000 runs for each of four grids, then casts them to the wide format using reshape::cast() and data.table::dcast() .

library(reshape)
library(data.table)

## Define a number of simulations
N_Sims <- 100000L

## Create a data frame
dataf <- data.frame(gridID = rep(c(1,2,3,4),N_Sims),
                    montecarlo = rep(1:N_Sims,each=4),
                    number=runif(N_Sims*4L,0,1) )

## Cast using reshape::cast()
castData <- reshape::cast(dataf, gridID ~ montecarlo, value='number')

## Create a fresh data frame to use with data.table
DT_dataf <- data.frame(gridID = rep(c(1,2,3,4),N_Sims),
                       montecarlo = rep(1:N_Sims,each=4),
                       number=runif(N_Sims*4L,0,1) )

## Convert to data.table by reference
setDT(DT_dataf)

## Cast using data.table::dcast()
DT_castData <- data.table::dcast(DT_dataf, gridID ~ montecarlo, value.var = 'number')

profvis Results:

Running the code above with profvis shows that using data.table::dcast() takes a fraction of the time used by reshape::cast() , and requires about 1/10th of the memory allocation.

比较

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM