简体   繁体   English

R阅读巨大的csv

[英]R reading a huge csv

I have a huge csv file. 我有一个巨大的csv文件。 Its size is around 9 gb. 它的大小约为9 GB。 I have 16 gb of ram. 我有16 gb的ram。 I followed the advises from the page and implemented them below. 我按照页面上的建议进行操作并在下面实现。

If you get the error that R cannot allocate a vector of length x, close out of R and add the following line to the ``Target'' field: 
--max-vsize=500M 

Still I am getting the error and warnings below. 我仍然收到下面的错误和警告。 How should I read the file of 9 gb into my R? 我应该如何将9 gb的文件读入我的R? I have R 64 bit 3.3.1 and I am running below command in the rstudio 0.99.903. 我有R 64位3.3.1,我在rstudio 0.99.903中运行命令。 I have windows server 2012 r2 standard, 64 bit os. 我有Windows Server 2012 r2标准,64位操作系统。

> memory.limit()
[1] 16383
> answer=read.csv("C:/Users/a-vs/results_20160291.csv")
Error: cannot allocate vector of size 500.0 Mb
In addition: There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
2: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
3: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
4: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
5: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
6: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
7: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
8: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
9: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
10: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
11: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)
12: In scan(file = file, what = what, sep = sep, quote = quote,  ... :
  Reached total allocation of 16383Mb: see help(memory.size)

------------------- Update1 ------------------- Update1

My 1st try based upon suggested answer 我的第一次尝试基于建议的答案

> thefile=fread("C:/Users/a-vs/results_20160291.csv", header = T)
Read 44099243 rows and 36 (of 36) columns from 9.399 GB file in 00:13:34
Warning messages:
1: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv",  :
  Reached total allocation of 16383Mb: see help(memory.size)
2: In fread("C:/Users/a-vsingh/results_tendo_20160201_20160215.csv",  :
  Reached total allocation of 16383Mb: see help(memory.size)

------------------- Update2 ------------------- Update2

my 2nd try based upon suggested answer is as below 我的第二次尝试基于建议的答案如下

thefile2 <- read.csv.ffdf(file="C:/Users/a-vs/results_20160291.csv", header=TRUE, VERBOSE=TRUE, 
+                    first.rows=-1, next.rows=50000, colClasses=NA)
read.table.ffdf 1..
Error: cannot allocate vector of size 125.0 Mb
In addition: There were 14 warnings (use warnings() to see them)

How could I read this file into a single object so that I can analyze the entire data in one go 我怎样才能将这个文件读入一个对象,这样我就可以一次性分析整个数据

------------------update 3 ------------------更新3

We bought an expensive machine. 我们买了一台昂贵的机器。 It has 10 cores and 256 gb ram. 它有10个内核和256 GB RAM。 That is not the most efficient solution but it works at least in near future. 这不是最有效的解决方案,但它至少在不久的将来有效。 I looked at below answers and I dont think they solve my problem :( I appreciate these answers. I want to perform the market basket analysis and I dont think there is no other way around rather than keeping my data in RAM 我看了下面的答案,我不认为他们解决了我的问题:(我很欣赏这些答案。我想进行市场篮子分析,我不认为没有别的办法,而不是把我的数据保存在RAM中

Make sure you're using 64-bit R, not just 64-bit Windows, so that you can increase your RAM allocation to all 16 GB. 确保您使用的是64位R,而不仅仅是64位Windows,这样您就可以将RAM分配增加到所有16 GB。

In addition, you can read in the file in chunks: 此外,您可以在块中读取文件:

file_in    <- file("in.csv","r")
chunk_size <- 100000 # choose the best size for you
x          <- readLines(file_in, n=chunk_size)

You can use data.table to handle reading and manipulating large files more efficiently: 您可以使用data.table来更有效地处理和操作大文件:

require(data.table)
fread("in.csv", header = T)

If needed, you can leverage storage memory with ff : 如果需要,您可以利用ff利用存储内存:

library("ff")
x <- read.csv.ffdf(file="file.csv", header=TRUE, VERBOSE=TRUE, 
                   first.rows=10000, next.rows=50000, colClasses=NA)

You might want to consider leveraging some on-disk processing and not have that entire object in R's memory. 您可能需要考虑利用一些磁盘上的处理,而不是在R的内存中拥有整个对象。 One option would be to store the data in a proper database then have R access that. 一种选择是将数据存储在适当的数据库中,然后具有R访问权限。 dplyr is able to deal with a remote source (it actually writes the SQL statements to query the database). dplyr能够处理远程源(它实际上编写SQL语句来查询数据库)。 I've just tested this with a small example (a mere 17,500 rows), but hopefully it scales up to your requirements. 我刚用一个小例子(仅17,500行)测试了这个,但希望它可以扩展到你的要求。

Install SQLite 安装SQLite

https://www.sqlite.org/download.html https://www.sqlite.org/download.html

Enter the data into a new SQLite database 将数据输入新的SQLite数据库

  • Save the following in a new file named import.sql 将以下内容保存在名为import.sql的新文件中

CREATE TABLE tableName (COL1, COL2, COL3, COL4); .separator , .import YOURDATA.csv tableName

Yes, you'll need to specify the column names yourself (I believe) but you can specify their types here too if you wish. 是的,您需要自己指定列名称(我相信),但如果您愿意,也可以在此处指定其类型。 This won't work if you have commas anywhere in your names/data, of course. 如果你的名字/数据中有逗号,那么这将不起作用。

  • Import the data into the SQLite database via the command line 通过命令行将数据导入SQLite数据库

sqlite3.exe BIGDATA.sqlite3 < import.sql

Point dplyr to the SQLite database dplyr指向SQLite数据库

As we're using SQLite, all of the dependencies are handled by dplyr already. 当我们使用SQLite时,所有依赖项都由dplyr处理。

library(dplyr) my_db <- src_sqlite("/PATH/TO/YOUR/DB/BIGDATA.sqlite3", create = FALSE) my_tbl <- tbl(my_db, "tableName")

Do your exploratory analysis 做你的探索性分析

dplyr will write the SQLite commands needed to query this data source. dplyr将编写查询此数据源所需的SQLite命令。 It will otherwise behave like a local table. 否则它将表现得像本地表。 The big exception will be that you can't query the number of rows. 最大的例外是您无法查询行数。

my_tbl %>% group_by(COL2) %>% summarise(meanVal = mean(COL3))

#>  Source:   query [?? x 2]
#>  Database: sqlite 3.8.6 [/PATH/TO/YOUR/DB/BIGDATA.sqlite3]
#>  
#>         COL2    meanVal
#>        <chr>      <dbl>
#>  1      1979   15.26476
#>  2      1980   16.09677
#>  3      1981   15.83936
#>  4      1982   14.47380
#>  5      1983   15.36479

This may not be possible on your computer. 这可能无法在您的计算机上进行。 In certain cases, data.table takes up more space than its .csv counterpart. 在某些情况下, data.table.csv对应物占用更多空间。

DT <- data.table(x = sample(1:2,10000000,replace = T))
write.csv(DT, "test.csv") #29 MB file
DT <- fread("test.csv", row.names = F)   
object.size(DT)
> 40001072 bytes #40 MB

Two OOM larger: 两个OOM更大:

DT <- data.table(x = sample(1:2,1000000000,replace = T))
write.csv(DT, "test.csv") #2.92 GB file
DT <- fread("test.csv", row.names = F)   
object.size(DT)
> 4000001072 bytes #4.00 GB

There is natural overhead to storing an object in R. Based on these numbers, there is roughly a 1.33 factor when reading files, However, this varies based on data. 在R中存储对象存在自然的开销。基于这些数字,在读取文件时大约有1.33因子,但是,这取决于数据。 For example, using 例如,使用

  • x = sample(1:10000000,10000000,replace = T) gives a factor roughly 2x (R:csv). x = sample(1:10000000,10000000,replace = T)给出大约2倍的因子(R:csv)。

  • x = sample(c("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv). x = sample(c("foofoofoo","barbarbar"),10000000,replace = T)给出0.5x(R:csv)的因子。

Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more. 基于最大值,你的9GB文件将占用18GB的潜在内存存储在R中,如果不是更多的话。 Based on your error message, it is far more likely that you are hitting hard memory constraints vs. an allocation issue. 根据您的错误消息,您更有可能遇到硬内存限制与分配问题。 Therefore, just reading your file in chucks and consolidating would not work - you would also need to partition your analysis + workflow. 因此,只需在chucks中读取文件并进行整合就行不通了 - 您还需要对分析+工作流进行分区。 Another alternative is to use an in-memory tool like SQL. 另一种方法是使用像SQL这样的内存工具。

This would be horrible practice, but depending on how you need to process this data, it shouldn't be too bad. 这将是一种可怕的做法,但根据您需要如何处理这些数据,它应该不会糟糕。 You can change your maximum memory that R is allowed to use by calling memory.limit(new) where new an integer with R's new memory.limit in MB . 您可以通过调用memory.limit(new)来更改允许使用R的最大内存。其中new是一个带有R的新memory.limit的整数,以MB为单位 What will happen is when you hit the hardware constraint, windows will start paging memory onto the hard drive (not the worst thing in the world, but it will severely slow down your processing). 当你遇到硬件约束时,Windows将开始将内存分配到硬盘驱动器上(这不是世界上最糟糕的事情,但会严重降低处理速度)。

If you are running this on a server version of windows paging will possibly (likely) work different than from regular Windows 10. I believe it should be faster as the Server OS should be optimized for this stuff. 如果你在服务器版本的Windows上运行它,那么分页可能(可能)与常规Windows 10不同。我相信它应该更快,因为服务器操作系统应针对这些东西进行优化。

Try starting of with something along the lines of 32 GB (or memory.limit(memory.limit()*2) ) and if it comes out MUCH larger than that, I would say that the program will end up being too slow once it is loaded into memory. 尝试使用32 GB (或memory.limit(memory.limit()*2) )的内容开始,如果它出现的大于那个,我会说程序最终会变得太慢被加载到内存中。 At that point I would recommend buying some more RAM or finding a way to process in parts. 那时我建议购买更多内存或找到一种方法来处理部分内容。

You could try splitting your processing over the table. 您可以尝试将处理拆分为表格。 Instead of operating on the whole thing, put the whole operation inside a for loop and do it 16, 32, 64, or however many times you need to. 而不是对整个操作进行操作,将整个操作放在for循环中并执行16,32,64,或者无论多少次都需要。 Any values you need for later computation can be saved. 可以保存以后计算所需的任何值。 This isn't as fast as other posts but it will definitely return. 这不如其他帖子快,但它肯定会回来。

x = number_of_rows_in_file / CHUNK_SIZE
for (i in c(from = 1, to = x, by = 1)) {
    read.csv(con, nrows=CHUNK_SIZE,...)
}

Hope that helps. 希望有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM