简体   繁体   English

ffdfdply,R中的拆分和内存限制

[英]ffdfdply, splitting and memory limit in R

I'm having "Error: cannot allocate vector of size ...MB" problem using ff/ffdf and ffdfdply function. 我使用ff / ffdf和ffdfdply函数遇到“错误:无法分配大小... MB的向量”问题。

I'm trying to use ff and ffdf packages to process large amount of data that has been keyed into groups. 我正在尝试使用ff和ffdf包来处理已键入到组中的大量数据。 Data (in ffdf table format) looks like this: 数据(ffdf表格式)如下所示:

x = 

id_1    id_2    month    year    Amount    key
   1      13        1    2013     -200      11
   1      13        2    2013      300      54
   2      19        1    2013      300      82
   3      33        2    2013      300      70

.... (10+ Million rows)

The unique keys are created using something like: 唯一键是使用类似以下方式创建的:

x$key = as.ff(as.integer(ikey(x[c("id_1","id_2","month","year")])))

To summarise by grouping using the key variable, I have this command: 要总结通过使用key变量分组,我有以下命令:

summary = ffdfdply(x=x, split=x$key, FUN=function(df) {
  df = data.table(df)
  df = df[,list(id_1 = id_1[1], withdraw = sum(Amount*(Amount>0),na.rm=T), by = "key"]
  df
},trace=T)

Using data.table's excellent grouping feature (idea taken from this discussion ). 使用data.table的出色分组功能( 此讨论中的想法)。 In the real code there are more functions to be applied to the Amount variable, but even with this I can not process the full ffdf table (a smaller subset of the table works fine). 在实际代码中,还有更多函数要应用于Amount变量,但是即使如此,我也无法处理完整的ffdf表(表的较小子集可以正常工作)。

It seems like ffdfdplyis using huge amount of ram, giving the: ffdfdplyis似乎在使用大量的ram,从而得到:

Error: cannot allocate vector of size 64MB

Also BATCHBYTES does not seem to help. 另外BATCHBYTES似乎没有帮助。 Any one with experience with ffdffply can recommend any other way to go about this, without pre-splitting the ffdf table into chunks? 任何具有ffdffply经验的人都可以推荐任何其他方法来解决此问题,而无需将ffdf表预先拆分成块?

The most difficult part about using ff/ffbase is making sure your data stays in ff and not accidently put it in RAM. 使用ff / ffbase的最困难的部分是确保您的数据保留在ff中,并且不会意外地将其放入RAM。 As once you will have put your data in RAM (mostly due to some misunderstanding of when data is put in RAM and when it is not), it is hard to get your RAM back from R and if you are working on your RAM limit, a small extra request of RAM will get your 'Error: cannot allocate vector of size'. 一旦您将数据放入RAM中(主要是由于误解了何时将数据放入RAM中以及何时未将数据放入RAM中),很难将RAM从R中取回,并且如果您正在对RAM进行限制,额外的少量RAM请求将导致您出现“错误:无法分配大小向量”。

Now, I think you misspecified the input to ikey. 现在,我认为您没有正确指定ikey的输入。 Look at ?ikey , it requires as input argument an ffdf, not several ff vectors. 看一下?ikey ,它需要一个ffdf作为输入参数,而不是几个ff向量。 Probably this has put your data in RAM while what you wanted is probably to use ikey(x[c("id_1","id_2","month","year")]) 可能这已将您的数据放入RAM中,而您可能想要使用ikey(x[c("id_1","id_2","month","year")])

It simulated some data on my computer as follows to get an ffdf with 24Mio rows, and the following does not give me RAM troubles (it uses approx 3.5Gb of RAM in my machine) 它在我的计算机上模拟了一些数据,如下所示,以获取具有24Mio行的ffdf,而以下操作并没有给我带来RAM麻烦(它在我的计算机中使用了大约3.5Gb的RAM)

require(ffbase)
require(data.table)
x <- expand.ffgrid(id_1 = ffseq(1, 1000), id_2 = ffseq(1, 1000), year = as.ff(c(2012,2013)), month = as.ff(1:12))
x$Amount <- ffrandom(nrow(x), rnorm, mean = 10, sd = 5)
x$key <- ikey(x[c("id_1","id_2","month","year")])
x$key <- as.character(x$key)
summary <- ffdfdply(x, split=x$key, FUN=function(df) {
  df <- data.table(df)
  df <- df[, list(
    id_1 = id_1[1], 
    id_2 = id_2[1],
    month = month[1],
    year = year[1],
    withdraw = sum(Amount*(Amount>0), na.rm=T)
  ), by = key]
  df
}, trace=TRUE)

Another reason might be that you have too much other data in RAM which you are not talking about. 另一个原因可能是您在谈论的RAM中还有太多其他数据。 Mark also that in ff, all your factor levels are in RAM, this might also be an issue if you are working with a lot of character/factor data - in that case you need to be asking yourself whether you really need these data in your analysis or not. 还要标记在ff中,所有因子级别都在RAM中,如果您要处理大量字符/因子数据,这可能也是一个问题-在这种情况下,您需要问自己是否真的需要这些数据分析与否。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM