在R中导入大型csv文件，read.csv.ffdf中出现错误

Question

I want to import a faily large file (40Mrows x 4columns). 我想导入一个有故障的大文件（40Mrows x 4columns）。 I ended up using ffbase , after a try to sqldf 我尝试使用sqldf后最终使用ffbase

I tried base::read.csv : It failed. 我尝试了base::read.csv ：失败了。 I tried sqldf::sqldf : It failed too saying it could not allocate anymore. 我尝试了sqldf::sqldf ：它也失败了，说它不能再分配了。

I am just trying to replicate the example given in the ffbase vignette. 我只是想复制ffbase小插图中给出的示例。

R) x <- data.frame(log=rep(c(FALSE, TRUE), length.out=26), int=1:26, dbl=1:26 + 0.1,   fac=factor(letters), ord=ordered(LETTERS), dct=Sys.time()+1:26, dat=seq(as.Date("1910/1/1"), length.out=26, by=1))
R) x <- x[c(13:1, 13:1),]
R) csvfile <- tempPathFile(path=getOption("fftempdir"), extension="csv")
R) write.csv(x, file=csvfile, row.names=FALSE)
R) y <- read.csv(file=csvfile, header=TRUE)
R) y
 log int  dbl fac ord                       dct        dat
1  FALSE  13 13.1   m   M 2012-11-26 11:21:29.15763 1910-01-13
2   TRUE  12 12.1   l   L 2012-11-26 11:21:28.15763 1910-01-12
3  FALSE  11 11.1   k   K 2012-11-26 11:21:27.15763 1910-01-11
4   TRUE  10 10.1   j   J 2012-11-26 11:21:26.15763 1910-01-10
...
23  TRUE   4  4.1   d   D 2012-11-26 11:21:20.15763 1910-01-04
24 FALSE   3  3.1   c   C 2012-11-26 11:21:19.15763 1910-01-03
25  TRUE   2  2.1   b   B 2012-11-26 11:21:18.15763 1910-01-02
26 FALSE   1  1.1   a   A 2012-11-26 11:21:17.15763 1910-01-01


# ---- !!!!! HERE !!!! ---- #
R) ffx <- read.csv.ffdf(file=csvfile, header=TRUE)
Erreur dans ff(initdata = initdata, length = length, levels = levels, ordered = ordered,  : vmode 'character' not implemented

I don't understand... 我不明白

Do you have any insight? 你有什么见识？

Answer 1

You probably need to pass the argument colClasses as follows. 您可能需要按如下所示传递参数colClasses。 As you would do with a normal read.csv 就像使用普通的read.csv一样

ffx <- read.csv.ffdf(file=csvfile, header=TRUE, colClasses = c("logical","integer","numeric","factor","factor","POSIXct","Date"))

Answer 2

sorry I am late I had no access to R last 3 days. 抱歉，我迟到了，我最近3天无法访问R。 Here is some additional code for read.csv 这是read.csv一些其他代码

  R) setAs("character","myDate", function(from) as.Date(from, format="%d/%m/%y") )
  R) system.time(data <- read.csv(file=filePath, sep=";", stringsAsFactors=TRUE, colClasses=c("factor","factor","numeric","myDate"), nrows=10));

    utilisateur     système      écoulé 
    0               0            0 
  R) system.time(data <- read.csv(file=filePath, sep=";", stringsAsFactors=TRUE, colClasses=c("factor","factor","numeric","myDate")));
    Erreur : impossible d'allouer un vecteur de taille 250.0 Mo
    Timing stopped at: 236.2 4.92 333.3

=> So read.csv can't handle that number of lines. =>因此， read.csv无法处理该行数。

Same test for read.csv.sql which is a wrapper of sqldf only for 500 rows. 对read.csv.sql相同测试，该测试仅是500行的sqldf的包装器。

R) system.time(data <- read.csv.sql(filePath, dbname = tempfile(), header = T, row.names = F, sep=";"));
   utilisateur     système      écoulé 
   0.07            0.00        0.07

BTW please be advise that the nbrows option is !NOT WORKING!... abd that you cannot indicate a colClasses argument... 顺便说一句，请注意， nbrows选项是！NOT WORKING！... abd，您不能指示colClasses参数...

R) system.time(data <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F, sep=";")));
   Erreur : impossible d'allouer un vecteur de taille 500.0 Mo
   Timing stopped at: 366.8 42.45 570.2

For the whole table it crashes... Strange as it is supposed to be a reference for big data... 对于整个表，它崩溃了……奇怪，因为它应该作为大数据的参考...

And finally using package ff , for 50 rows 最后使用ff包，进行50行

R) system.time(data <- read.csv.ffdf(file=filePath, header=TRUE, nrows=50, colClasses=c("factor","factor","numeric","myDate"),sep=";"))
   utilisateur     système      écoulé 
   0.02            0.00         0.03

Please note that head(data) also has a bug, it does not display columns accurately... 请注意， head(data)也有一个错误，它不能正确显示列...

And for the whole table... IT WORKS ... !fireworks! 对于整个桌子...它的工作...！烟花！

R) system.time(data <- read.csv.ffdf(file=filePath, header=TRUE, colClasses=c("factor","factor","numeric","myDate"),sep=";"))
   utilisateur     système      écoulé 
   409.69          14.42        547.75

For a 36M rows table 对于36M行表

R) dim(data)
   [1] 36083010        4

As a consequence I recommand ff package to load big dataset 因此，我建议ff包加载大数据集

在R中导入大型csv文件，read.csv.ffdf中出现错误

问题描述

2 个解决方案

解决方案1
3

解决方案2
2 已采纳 2012-11-29 12:51:39

在R中导入大型csv文件，read.csv.ffdf中出现错误

问题描述

2 个解决方案

解决方案1 3

解决方案2 2 已采纳 2012-11-29 12:51:39

解决方案1
3

解决方案2
2 已采纳 2012-11-29 12:51:39