简体   繁体   中英

R ffdfdply split issue

I have a problem with R, ffdfdply function

a=as.ffdf(data.frame(b=11:20,c=c(4,4,4,4,4,5,5,5,5,5), d=c(1,1,1,0,0,0,1,0,1,1)))

ffdfdply(a, split=a$c, FUN= function(x) {data.frame(cumsum(x$d))}, trace=T)

The output it generate is simply a cumulative sum without considering the split criteria.

I need an output like this

c   cumsum
4    1
4    2
4    3
4    4
4    4
5    0
5    1
5    1
5    2
5    3

Can we include multiple columns under "split"? It would be great, if anyone provides an example also.

Thanks.


@jwijffels, I test your solution on other set of data

i=as.ffdf(data.frame(a=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2), b=c(1,4,6,2,5,3,1,4,3,2,8,7,1,3,5,4,2,6,3,1,2), c=c(1,1,1,1,1,1,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2), d=c(1,0,1,1,0,1,0,1,1,0,0,1,1,1,0,0,1,1,1,1,0)))

The output I received is incorrect. I need an cumulative sum of column d on the basis of column a and c.

the below step is correct and gave correct result

idx <- ffdforder(i[c("a","c","b")])
ordered_i <- i[idx, ]
ordered_i$key_a_c <- ikey(ordered_i[c("a", "c")])

but when I try to cumulative sum, got incorrect result.

cumsum_i <- ffdfdply(ordered_i, split=as.character(ordered_i$key_a_c), FUN= function(x) {
    ## Data in RAM, on which you can use data.table
    x <- as.data.table(x)
    result <- x[, cumsum_a_c := cumsum(x$d), by = list(key_a_c)]
    as.data.frame(result)
}, trace=T)

Please help. I need to run these set of command on big data.

The correct usage will be this

require(ffbase)
require(data.table)
a=as.ffdf(data.frame(b=11:20,c=c(4,4,4,4,4,5,5,5,5,5), d=c(1,1,1,0,0,0,1,0,1,1)))
ffdfdply(a, split=as.character(a$c), FUN= function(x) {
  ## Data in RAM, on which you can use data.table
  x <- as.data.table(x)
  result <- x[, cumsum := cumsum(d), by = list(c)]
  as.data.frame(result)
  }, trace=T)

If you want to split by 2 columns, just make a new column combining both columns and use that as split. See ?ikey for creating that column

Reading the help is somewhat helpful here, from ?ffdfdply

this function does not actually split the data. In order to reduce the number of times data is put into RAM for situations with a lot of split levels, the function extracts groups of split elements which can be put into RAM according to BATCHBYTES.

AND....

Please make sure your FUN covers the fact that several split elements can be in one chunk of data on which FUN is applied.

So from my reading of that you need to actually have a split-combine-style function that works on groups within the function you call by ffdfdply as well. Like so using ave :

a$c <- with(a, as.integer(c))
ffdfdply(
    a,
    split=a$c,
    function(x) data.frame(c=x$c,cumsum=ave(x$d,x$c,FUN=cumsum)), 
    trace=T
)

Result:

   c cumsum
1  4      1
2  4      2
3  4      3
4  4      3
5  4      3
6  5      0
7  5      1
8  5      1
9  5      2
10 5      3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM