为什么在使用read.csv（）时R使用这么多的内存？

Question

I'm running R on linux (kubuntu trusty). 我在linux上运行R（可信赖的Kubuntu）。 I have a csv file that's nearly 400MB, and contains mostly numeric values: 我有一个将近400MB的csv文件，其中大部分包含数值：

$ ls -lah combined_df.csv 
-rw-rw-r-- 1 naught101 naught101 397M Jun 10 15:25 combined_df.csv

I start R, and df <- read.csv('combined_df.csv') (I get a 1246536x25 dataframe, 3 int columns, 3 logi, 1 factor, and 18 numeric) and then use the script from here to check memory usage: 我启动R，然后df <- read.csv('combined_df.csv') （我得到了1246536x25数据帧，3个int列，3个logi，1个因子和18个数字），然后从此处使用脚本检查内存使用情况：

R> .ls.objects()
         Type  Size    Rows Columns
df data.frame 231.4 1246536      25

Bit odd that it's reporting less memory, but I guess that's just because CSV isn't an efficient storage method for numeric data. 奇怪的是它报告的内存更少，但是我想那仅仅是因为CSV并不是一种有效的数字数据存储方法。

But when I check the system memory usage, top says that R is using 20% of my available 8GB of RAM. 但是当我检查系统内存使用情况时， top说R正在使用我8GB可用内存的20％。 And ps reports similar: 和ps报告类似：

$ ps aux|grep R
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
naught1+ 32364  5.6 20.4 1738664 1656184 pts/1 S+   09:47   2:42 /usr/lib/R/bin/exec/R

1.7Gb of RAM for a 379MB data set. 1.7GB RAM，用于379MB数据集。 That seems excessive. 那似乎太过分了。 I know that ps isn't necessarily an accurate way of measuring memory usage , but surely it isn't out by a factor of 5?! 我知道ps不一定是衡量内存使用情况的准确方法，但是肯定不是5分之一！ Why does R use so much memory? 为什么R使用这么多内存？

Also, R seems to report something similar in gc() 's output: 另外，R似乎在gc()的输出中报告了类似的内容：

R> gc()
           used  (Mb) gc trigger  (Mb)  max used  (Mb)
Ncells   497414  26.6    9091084 485.6  13354239 713.2
Vcells 36995093 282.3  103130536 786.9 128783476 982.6

Answer 1

As noted in my comment above, there is a section in the documention ?read.csv entitled "Memory Usage" that warns that anything based on read.table may use a "surprising" amount of memory and recommends two things: 正如我在上面的评论中所指出的，文档?read.csv有一个标题为“内存使用量”的部分，警告基于read.table任何内容都可能使用“令人惊讶的”内存量，并建议两件事：

Specify the type of each column using the colClasses argument, and 使用colClasses参数指定每列的类型，然后
Specifying nrows , even as a "mild overestimate". 指定nrows ，甚至作为“轻度高估”。

Answer 2

Not sure if you just want to know how R works or if you want an alternative to read.csv , but try fread from data.table , it is much faster and I assume it uses much less memory: 不知道，如果你只是想知道[R是如何工作的，或者如果你想要一个替代read.csv ，但尝试fread从data.table ，这是更快，我相信它使用较少的内存：

library(data.table)
dfr <- as.data.frame(fread("somecsvfile.csv"))

为什么在使用read.csv（）时R使用这么多的内存？

问题描述

2 个解决方案

解决方案1
5 已采纳 2014-07-07 03:11:26

解决方案2
1 2014-07-07 11:14:52

为什么在使用read.csv（）时R使用这么多的内存？

问题描述

2 个解决方案

解决方案1 5 已采纳 2014-07-07 03:11:26

解决方案2 1 2014-07-07 11:14:52

解决方案1
5 已采纳 2014-07-07 03:11:26

解决方案2
1 2014-07-07 11:14:52