简体   繁体   中英

aggregation using ffdfdply function in R

I tried aggregation on large dataset using 'ffbase' package using ffdfdply function in R.
lets say I have three variables called Date,Item and sales. Here I want to aggregate the sales over Date and Item using sum function. Could you please guide me through some proper syntax in R.
Here I tried like this:

grp_qty <- ffdfdply(x=data[c("sales","Date","Item")], split=as.character(data$sales),FUN = function(data)  

summaryBy(Date+Item~sales, data=data, FUN=sum)).

I would appreciate for your solution.

Mark that ffdfdply is part of ffbase, not ff. To show an example of the usage of ffdfdply, let's generate an ffdf with 50Mio rows.

  require(ffbase)
  data <- expand.ffgrid(Date = ff(seq.Date(Sys.Date(), Sys.Date()+10000, by = "day")), Item = ff(factor(paste("Item", 1:5000))))
  data$sales <- ffrandom(n = nrow(data))
  # split by date -> assuming that all sales of 1 date can fit into RAM
  splitby <- as.character(data$Date, by = 250000)
  grp_qty <- ffdfdply(x=data[c("sales","Date","Item")], 
                      split=splitby, 
                      FUN = function(data){
                        ## This happens in RAM - containing **several** split elements so here we can use data.table which works fine for in RAM computing
                        require(data.table)
                        data <- as.data.table(data)
                        result <- data[, list(sales = sum(sales, na.rm=TRUE)), by = list(Date, Item)]
                        as.data.frame(result)
                      })
  dim(grp_qty)

Mark that grp_qty is an ffdf which resides on disk.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM