使用bigmemory在R中加载大型数据集的内存问题

Question

I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. 我有一个大文本文件（> 1000万行，> 1 GB），我希望一次处理一行，以避免将整个内容加载到内存中。 After processing each line I wish to save some variables into a big.matrix object. 处理big.matrix每一行后，我希望将一些变量保存到big.matrix对象中。 Here is a simplified example: 这是一个简化的例子：

library(bigmemory)
library(pryr)

con  <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')

for (i in 1:5){
   print(c(address(x), refs(x)))
   y <- readLines(con, n = 1, warn = FALSE)
   x[i] <- 2L*as.integer(y)
} 

close(con)

where x.csv contains 其中x.csv包含

Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix object and it appears to change with each loop iteration: 按照这里的建议http://adv-r.had.co.nz/memory.html我打印了我的big.matrix对象的内存地址，它似乎随着每次循环迭代而改变：

[1] "0x101e854d8" "2"          
[1] "0x101d8f750" "2"          
[1] "0x102380d80" "2"          
[1] "0x105a8ff20" "2"          
[1] "0x105ae0d88" "2"

Can big.matrix objects be modified in place? big.matrix对象可以在适当的位置修改吗？
is there a better way to load, process and then save these data? 有没有更好的方法来加载，处理，然后保存这些数据？ The current method is slow! 目前的方法很慢！

Answer 1

is there a better way to load, process and then save these data? 有没有更好的方法来加载，处理，然后保存这些数据？ The current method is slow! 目前的方法很慢！

The slowest part of your method appearts to be making the call to read each line individually. 方法中最慢的部分是调用单独读取每一行。 We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up. 我们可以“分块”数据，或者一次读取几行，以便不会达到内存限制，同时可能加快速度。

Here's the plan: 这是计划：

Figure out how many lines we have in a file 弄清楚我们在文件中有多少行
Read in a chunk of those lines 阅读这些行中的一大块
Perform some operation on that chunk 在该块上执行一些操作

Push that chunk back into a new file to save for later 将该块重新推送到新文件中以便以后保存

 library(readr) # Make a file x <- data.frame(matrix(rnorm(10000),100000,10)) write_csv(x,"./test_set2.csv") # Create a function to read a variable in file and double it calcDouble <- function(calc.file,outputFile = "./outPut_File.csv", read.size=500000,variable="X1"){ # Set up variables num.lines <- 0 lines.per <- NULL var.top <- NULL i=0L # Gather column names and position of objective column connection.names <- file(calc.file,open="r+") data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1) close(connection.names) col.name <- which(colnames(data.names)==variable) #Find length of file by line connection.len <- file(calc.file,open="r+") while((linesread <- length(readLines(connection.len,read.size)))>0){ lines.per[i] <- linesread num.lines <- num.lines + linesread i=i+1L } close(connection.len) # Make connection for doubling function # Loop through file and double the set variables connection.double <- file(calc.file,open="r+") for (j in 1:length(lines.per)){ # if stops read.table from breaking # Read in a chunk of the file if (j == 1) { data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="") } else { data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="") } # Grab the columns we need and double them double <- data[,I(col.name)] * 2 if (j != 1) { write_csv(data.frame(double),outputFile,append = TRUE) } else { write_csv(data.frame(double),outputFile) } message(paste0("Reading from Chunk: ",j, " of ",length(lines.per))) } close(connection.double) } calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")

So we get back a .csv file with the manipulated data. 因此，我们使用操纵数据返回.csv文件。 You can change double <- data[,I(col.name)] * 2 to whatever thing you need to do to each chunk. 您可以将double <- data[,I(col.name)] * 2更改为您需要对每个块执行的操作。

使用bigmemory在R中加载大型数据集的内存问题

问题描述

1 个解决方案

解决方案1
2 2015-07-14 17:45:06

使用bigmemory在R中加载大型数据集的内存问题

问题描述

1 个解决方案

解决方案1 2 2015-07-14 17:45:06

解决方案1
2 2015-07-14 17:45:06