[英]R: Is it possible to parallelize / speed-up the reading in of a 20 million plus row CSV into R?
Once the CSV is loaded via read.csv
, it's fairly trivial to use multicore
, segue
etc to play around with the data in the CSV. 一旦通过
read.csv
加载CSV,使用multicore
, segue
等来处理CSV中的数据是相当简单的。 Reading it in, however, is quite the time sink. 然而,阅读它是时候下沉了。
Realise it's better to use mySQL etc etc. 意识到最好使用mySQL等。
Assume the use of an AWS 8xl cluster compute instance running R2.13 假设使用运行R2.13的AWS 8xl集群计算实例
Specs as follows: 规格如下:
Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
Any thoughts / ideas much appreciated. 任何想法/想法都非常感激。
Going parallel might not be needed if you use fread
in data.table
. 如果在
data.table
使用fread
,则可能不需要并行。
library(data.table)
dt <- fread("myFile.csv")
A comment to this question illustrates its power. 对这个问题的评论说明了它的力量。 Also here's an example from my own experience:
这也是我自己的经验的一个例子:
d1 <- fread('Tr1PointData_ByTime_new.csv')
Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09
I was able to read in 1.04 million rows in under 10s! 我能够在10秒内读取104万行!
What you could do is use scan
. 你可以做的是使用
scan
。 Two of its input arguments could prove to be interesting: n
and skip
. 它的两个输入参数可能证明是有趣的:
n
和skip
。 You just open two or more connections to the file and use skip
and n
to select the part you want to read from the file. 您只需打开两个或多个文件连接,然后使用
skip
和n
选择要从文件中读取的部分。 There are some caveats: 有一些警告:
But you could give it a try and see if it gives a boost to your speed. 但你可以尝试一下,看看它是否会提高你的速度。
Flash or conventional HD storage? 闪存或传统高清存储? If the latter, then if you don't know where the file is on the drives, and how it's split, it's very hard to speed things up because multiple simultaneous reads will not be faster than one streamed read.
如果是后者,那么如果您不知道文件在驱动器上的位置以及它是如何拆分的,那么加速操作非常困难,因为多个同时读取不会比一个流读取更快。 It's because of the disk, not the CPU.
这是因为磁盘而不是CPU。 There's no way to parallelize this without starting at the storage level of the file.
如果没有从文件的存储级别开始,就无法并行化。
If it's flash storage then a solution like Paul Hiemstra's might help since good flash storage can have excellent random read performance, close to sequential. 如果它是闪存,那么像Paul Hiemstra这样的解决方案可能有所帮助,因为良好的闪存存储可以具有出色的随机读取性能,接近顺序。 Try it... but if it's not helping you know why.
尝试一下......但如果没有帮助你知道原因。
Also... a fast storage interface doesn't necessary mean the drives can saturate it. 此外...快速存储接口并不一定意味着驱动器可以使其饱和。 Have you run performance testing on the drives to see how fast they really are?
您是否对驱动器进行了性能测试,看看它们的速度有多快?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.