简体   繁体   中英

Group by on XDF file?

Say I have a huge source XDF file generated with RevoScaleR. I want to create a new target XDF by grouping the source entries on columns A, B, C and compute the sum, min, max, avg, std deviation on column D.

Let's assume the target data is too big to fit into memory too. How should I proceed? I could not find much information about group by operations in the documentation.

If you want to create a new xdf file I suggest using "RevoPemaR" library, which is include in the ML Server. It would be nice if you add a reproducible example, but the answer could be something like this:

library(RevoPemaR)
byGroupPemaObj <- PemaByGroup()
groupVals <- pemaCompute(
pemaObj = byGroupPemaObj, 
data = "input.xdf",
outData = "output.xdf", 
groupByVar =  c("A", "B", "C"), 
computeVars = c("D"),
    fnList = list(
     sum= list(FUN = sum, x = NULL, na.rm = TRUE),
     min= list(FUN = min, x = NULL, na.rm = TRUE)
     max= list(FUN = max, x = NULL, na.rm = TRUE),
     mean= list(FUN = mean, x = NULL, na.rm = TRUE),
     sd = list(FUN = sd, x = NULL, na.rm = TRUE)
    )
)

But you also have another option which is rxSummary. For each variable:

rxSummary(D~F(A), 
    data = "input.xdf" ,
    byGroupOutFile = "out.xdf", 
    summaryStats = c( "Mean", "StdDev", "Min", "Max", "Sum")
)

The dplyrXdf package lets you carry out dplyr operations like this on Xdf files.

library(dplyrXdf)
src <- RxXdfData("src.xdf")
dest <- src %>%
    group_by(A, B, C) %>%
    summarise(sum=sum(D), min=min(D), max=max(D), mean=mean(D), sd=sd(D))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM