Group by on XDF file?

Question

Say I have a huge source XDF file generated with RevoScaleR. I want to create a new target XDF by grouping the source entries on columns A, B, C and compute the sum, min, max, avg, std deviation on column D.

Let's assume the target data is too big to fit into memory too. How should I proceed? I could not find much information about group by operations in the documentation.

Answer 1

If you want to create a new xdf file I suggest using "RevoPemaR" library, which is include in the ML Server. It would be nice if you add a reproducible example, but the answer could be something like this:

library(RevoPemaR)
byGroupPemaObj <- PemaByGroup()
groupVals <- pemaCompute(
pemaObj = byGroupPemaObj, 
data = "input.xdf",
outData = "output.xdf", 
groupByVar =  c("A", "B", "C"), 
computeVars = c("D"),
    fnList = list(
     sum= list(FUN = sum, x = NULL, na.rm = TRUE),
     min= list(FUN = min, x = NULL, na.rm = TRUE)
     max= list(FUN = max, x = NULL, na.rm = TRUE),
     mean= list(FUN = mean, x = NULL, na.rm = TRUE),
     sd = list(FUN = sd, x = NULL, na.rm = TRUE)
    )
)

But you also have another option which is rxSummary. For each variable:

rxSummary(D~F(A), 
    data = "input.xdf" ,
    byGroupOutFile = "out.xdf", 
    summaryStats = c( "Mean", "StdDev", "Min", "Max", "Sum")
)

Answer 2

The dplyrXdf package lets you carry out dplyr operations like this on Xdf files.

library(dplyrXdf)
src <- RxXdfData("src.xdf")
dest <- src %>%
    group_by(A, B, C) %>%
    summarise(sum=sum(D), min=min(D), max=max(D), mean=mean(D), sd=sd(D))

Group by on XDF file?

Question

2 answers

solution1
3 2018-06-13 15:13:33

solution2
2 ACCPTED 2018-06-13 15:27:31

Group by on XDF file?

Question

2 answers

solution1 3 2018-06-13 15:13:33

solution2 2 ACCPTED 2018-06-13 15:27:31

solution1
3 2018-06-13 15:13:33

solution2
2 ACCPTED 2018-06-13 15:27:31