简体   繁体   中英

ffdfdply function crashes R and is very slow

learning how to compute tasks in R for large data sets (more than 1 or 2 GB), I am trying to use ff package and ffdfdply function. (See this link on how to use ffdfdply : R language: problems computing "group by" or split with ff package )

My data have the following columns:
"id" "birth_date" "diagnose" "date_diagnose"

There are several rows for each "id", and I want to extract the first date where there was a diagnose.

I would apply this :

library(ffbase)
library(plyr)
load(file=file_name); # to load my ffdf database, called data.f . 

my_fun <- function(x){
                      ddply( x , .(id), summarize, 
                      age  = min(date_diagnose - birth_date, na.rm=TRUE)) 
          }
result  <- ffdfdply(x = data.f, split = data.f$id,
                    FUN = function(x) my_fun(x) , trace=TRUE) ; 
result[1:10,] # to check.... 

It is very strange, but this command: ffdfdply(x = data.f, .... ) is making RStudio (and R) crash. Sometimes the same command will crash R and sometimes not. For example, if I trigger again the ffdfdply line (which worked the first time), R will crash.

Also using other functions, data, etc. will have the same effect. There is no memory increase, or anything into log.txt. Same behaviour when applying the summaryBy "technique"....

So if anybody has the same problem and found the solution, that would be very helpful. Also ffdfdply gets very slow (slower than SAS...) , and I am thinking about making another strategy to make this kind of tasks.

Is ffdfdply taking into account that for example the data set is ordered by id? (so it does not have to look into all the data to take the same ids... ).

So, if anybody knows other approaches to this ddply problem, it would be really great for all the "large data sets in R with low RAM memory" users.

This is my sessionInfo()

R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252   
[3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C                   
[5] LC_TIME=Danish_Denmark.1252    

 attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] plyr_1.7.1   ffbase_0.6-1 ff_2.2-10    bit_1.1-9

I also noticed this when using the package which we uploaded to CRAN recently. It seems to be caused by overloading in package ffbase the "[.ff" and "[<-.ff" extractor and setter functions from package ff.

I will remove this feature from the package and will upload it to CRAN soon. In the mean time, you can use the version 0.7 of ffbase, which you can get here: http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz

and install it as:

download.file("http://dl.dropbox.com/u/25690064/ffbase_0.7.tar.gz", "ffbase_0.7.tar.gz")
shell("R CMD INSTALL ffbase_0.7.tar.gz")

Let me know if that helped.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM