在R中访问一个大的csv：read.table.ffdf会变慢

Question

I am relatively new to using R and attempting to use data from a large CSV file (~13.2 million lines, ~250 fields per line, ~14 GB total). 我使用R并尝试使用大型CSV文件中的数据相对较新（大约1320万行，每行约250个字段，总共约14 GB）。 After searching for fast methods of accessing this data, I encountered the ff package and the read.table.ffdf method. 在搜索访问此数据的快速方法后，我遇到了ff包和read.table.ffdf方法。 I have been using it as follows: 我一直在使用它如下：

read.table.ffdf(file="mydata.csv",sep=',',colClass=rep("factor",250),VERBOSE=TRUE)

However, with the VERBOSE setting activated, I noticed that the following output indicates each successive block write tends to take increasingly long. 但是，激活VERBOSE设置后，我注意到以下输出表明每个连续的块写入往往需要越来越长。

read.table.ffdf 1..1000 (1000)  csv-read=0.131sec ffdf-write=0.817sec
read.table.ffdf 1001..18260 (17260)  csv-read=2.351sec ffdf-write=24.858sec
read.table.ffdf 18261..35520 (17260)  csv-read=2.093sec ffdf-write=33.838sec
read.table.ffdf 35521..52780 (17260)  csv-read=2.386sec ffdf-write=41.802sec
read.table.ffdf 52781..70040 (17260)  csv-read=2.428sec ffdf-write=43.642sec
read.table.ffdf 70041..87300 (17260)  csv-read=2.336sec ffdf-write=44.414sec
read.table.ffdf 87301..104560 (17260)  csv-read=2.43sec ffdf-write=52.509sec
read.table.ffdf 104561..121820 (17260)  csv-read=2.15sec ffdf-write=57.926sec
read.table.ffdf 121821..139080 (17260)  csv-read=2.329sec ffdf-write=58.46sec
read.table.ffdf 139081..156340 (17260)  csv-read=2.412sec ffdf-write=63.759sec
read.table.ffdf 156341..173600 (17260)  csv-read=2.344sec ffdf-write=67.341sec
read.table.ffdf 173601..190860 (17260)  csv-read=2.383sec ffdf-write=70.157sec
read.table.ffdf 190861..208120 (17260)  csv-read=2.538sec ffdf-write=75.463sec
read.table.ffdf 208121..225380 (17260)  csv-read=2.395sec ffdf-write=109.761sec
read.table.ffdf 225381..242640 (17260)  csv-read=2.824sec ffdf-write=131.764sec
read.table.ffdf 242641..259900 (17260)  csv-read=2.714sec ffdf-write=116.166sec
read.table.ffdf 259901..277160 (17260)  csv-read=2.277sec ffdf-write=97.019sec
read.table.ffdf 277161..294420 (17260)  csv-read=2.388sec ffdf-write=158.784sec

My understanding was that ff would avoid slowdown that comes from using all available RAM by storing the data frame in files. 我的理解是，通过将数据帧存储在文件中，ff可以避免因使用所有可用RAM而导致的减速。 It should take a similar amount of time to write each block, right? 写每个块应该花费相似的时间，对吧？ Is there something I have done incorrectly or a better approach to what I wish to accomplish? 我做错了什么或者更好地解决了我想要完成的事情吗？

Thanks in advance for any insights you might have to offer! 提前感谢您提供的任何见解！

Answer 1

Have you tried the fread function from the data.table package? 你有没有尝试过data.table包中的fread函数？ I load files of that size frequently and despite the fact that it takes some time, it is robust and much much faster than base R. Give it a go. 我经常加载那个大小的文件，尽管它需要一些时间，但它比基础R强大且快得多。一试。

library(data.table)
X<-fread("mydata.csv")

在R中访问一个大的csv：read.table.ffdf会变慢

问题描述

1 个解决方案

解决方案1
3 2014-11-10 16:41:21

在R中访问一个大的csv：read.table.ffdf会变慢

问题描述

1 个解决方案

解决方案1 3 2014-11-10 16:41:21

解决方案1
3 2014-11-10 16:41:21