简体   繁体   中英

using ffdfdply to split data and get characteristics of each id in the split

Within R I'm using ffdf to work with a large dataset. I want to use ffdfdply from the ffbase package to split the data according to a certain variable (var) and then compute some characteristics for all the observations with a unique value for var (for example: the number of observations for each unique value of var). To see if this is possible using ffdfdply I executed the example described below.

I expected that it would split on each Species and then calculate the minimum Petal.Width for each Species and then return a two columns each with three entries listing the Species and minimum Petal.Width for that Species . Expected output:

  Species    min_pw
1 setosa     0.1       
2 versicolor 1.0       
3 virginica  1.4  

However for BATCHBYTES=5000 it will use two splits, one containing two Species and the other containing one Species. This results in the following:

  Species   min_pw
1 setosa    0.1      
2 virginica 1.4    

When I change BATCHBYTES to 2000, this will force ffdfdply to use three splits and thus results in the expected output posted above. However I want to have another way of enforcing a split into each unique value of the variable assigned to 'split'. Is there any way to make this happen? Or do you have any other suggestions to get the result I need?

ffiris <- as.ffdf(iris)
result <- ffdfdply(x = ffiris,
                   split = ffiris$Species,
                   FUN = function(x) {
                      min_pw <- min(x$Petal.Width)
                      data.frame(Species=x$Species, min_pw= min_pw)
                   },
                   BATCHBYTES = 5000,
                   trace=TRUE
)
dim(result)
dim(iris)
result

The function ffdfdply was designed when you have a lot of split elements eg when you have 1000000 customers and you want to have data in memory at least split by customer but possibly more customers if your RAM allows such that the internals do not need to do an ffwhich 1000000 times. That is why the doc of ffdfdply states:

Please make sure your FUN covers the fact that several split elements can be in one chunk of data on which FUN is applied.' So the solution for your issue is to cover this in FUN namely as follows eg

FUN=function(x){
  require(doBy)
  summaryBy(Petal.Width ~ Species, data=x, keep.names=TRUE, FUN=min)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM