[英]Why summarise in ffbase2 (dplyr_ffbase) shows “error in as.vmode.default() (list) object cannot be coerced to type 'double'”?
I have a large (23 Mln rows) ffdf table (tbl_ffdf) with 10 columns, 7 of them are factors and 3 contain numbers. 我有一个很大的(23 Mln行)ffdf表(tbl_ffdf),其中有10列,其中7个是因子,3个包含数字。 It looks something like this: 看起来像这样:
TABLE_bad
F1 F2 F3 F4 F5 F6 F7 N1 N2 N3
1111 01.15 05.14 busns AA 16 F 55.2 16165 0
1111 01.15 05.14 busns AA 16 F 12.5 0 4545
2222 12.14 11.14 privt KM 5 T 0.7 255 987777
2222 12.14 11.14 privt KM 5 T 111.6 7800 0
I'd like to aggregate the data with sum(Nx) to remove this kind of duplicates and make my table look like this: 我想用sum(Nx)聚合数据以删除这种重复项并使我的表看起来像这样:
TABLE_ok
F1 F2 F3 F4 F5 F6 F7 N1 N2 N3
1111 01.15 05.14 busns AA 16 F 57.7 16165 4545
2222 12.14 11.14 privt KM 5 T 112.3 8055 987777
I'm using package ffbase2 installed from github (which is dplyr for ffdf tables). 我正在使用从github安装的软件包ffbase2(对于ffdf表是dplyr)。 I'm doing following: 我正在执行以下操作:
TABLE_gr <- group_by(TABLE_bad, F1, F2, F3, F4, F5, F6, F7) # this step finishes OK
# in approximately 90 sec
TABLE_ok <- summarise(TABLE_gr, sN1 = sum(N1), sN2 = sum(N2), sN3 = sum(N3))
and after that it works ~ 10 sec and says 然后它工作了约10秒,并说
Error in as.vmode.default(value, vmode) :
(list) object cannot be coerced to type 'double'
after that it goes in debug mode accordingly to the settings in my Rstudio, and it takes him ~ 3-5 MINUTES to go deep enough, stop hanging computer and show code of fuction which made error: 之后,它会根据我的Rstudio中的设置进入调试模式,这需要他3-5分钟才能深入到足够的深度,停止挂起计算机并显示功能代码,从而导致错误:
function (x, ...)
UseMethod("as.vmode")
Here in Data we can see that x is data.frame of F1 values. 在“数据”中,我们可以看到x是F1值的data.frame。 And in Traceback - functions 并在Traceback中-函数
eval(expr, envir, enclose)
`[<-`(`*tmp*`, ff::hi(N + 1, N + n), , value = -*etc*-
append_to(out, res, -*etc*-
summarise_.grouped_ffdf( -*etc*-
Watching into source code of ffbase2 gave me not much... I've got something like method summarise_.grouped_ffdf uses recursive slicing of data and, probably, on last step it gets some data.frame but wanted to get a matrix?.. it's a usual reason of "(list) object cannot be coerced to type 'double'" error. 看着ffbase2的源代码给了我很多...我有类似方法summarise_.grouped_ffdf使用数据的递归切片,并且可能在最后一步得到了一些data.frame但想要得到矩阵?这是“(列表)对象不能被强制键入'double'”错误的常见原因。
I have no idea what is the real reason of this error and how to fix it. 我不知道此错误的真正原因是什么以及如何解决。 Help please! 请帮助! :-) :-)
Today I've found what was the matter of the error. 今天,我发现了错误所在。 The part of source code of summarise_.grouped_ffdf
looks like this: summarise_.grouped_ffdf
的源代码部分如下所示:
42 for (i in grouped_chunks(.data)){
43 ch <- grouped_df(data_s[i,,drop=FALSE], groups(.data))
44 res <- summarise_(ch, .dots = dots)
45 out <- append_to(out, res, check_structure=FALSE)
46 }
This function cuts data into pieces according to groups (line 43) and applies usual dplyr summarise to them (line 44). 此功能将数据按组切成小块(第43行),并将常用的dplyr摘要应用于它们(第44行)。 Then it appends the result to the output variable. 然后将结果附加到输出变量。 But looking into source of append_to
shows us that for correct appending variable res
must be a tbl_ffdf
object, but here we have simple data.frame
. 但是,查看append_to
源向我们展示了,要正确附加变量, res
必须是tbl_ffdf
对象,但是这里有简单的data.frame
。 So, modifying the line 45 of the file manip-grouped-ffdf.r
in the following way completely solves the problem: 因此,以以下方式修改文件manip-grouped-ffdf.r
的第45行可以完全解决问题:
45 out <- append_to(out, tbl_ffdf(res), check_structure=FALSE)
That's very nice, but after that I had running out-of-memory problems when using this summarise. 很好,但是在使用此摘要时,我遇到了内存不足的问题。 Investigation lead to the fact it's because of grouped_chunks(.data)
. 调查导致事实是因为grouped_chunks(.data)
。 I didn't dig why it's so and what to do here, i just made month-by-month slicing of my data in for loop, with appending aggregated chunks to each other after that. 我没有弄清楚为什么会这样以及在这里做什么,我只是在for循环中逐月对数据进行切片,然后将聚合后的块彼此追加。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.