使用bigmemory在R中加載大型數據集的內存問題

Question

我有一個大文本文件（> 1000萬行，> 1 GB），我希望一次處理一行，以避免將整個內容加載到內存中。 處理big.matrix每一行后，我希望將一些變量保存到big.matrix對象中。 這是一個簡化的例子：

library(bigmemory)
library(pryr)

con  <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')

for (i in 1:5){
   print(c(address(x), refs(x)))
   y <- readLines(con, n = 1, warn = FALSE)
   x[i] <- 2L*as.integer(y)
} 

close(con)

其中x.csv包含

按照這里的建議http://adv-r.had.co.nz/memory.html我打印了我的big.matrix對象的內存地址，它似乎隨着每次循環迭代而改變：

[1] "0x101e854d8" "2"          
[1] "0x101d8f750" "2"          
[1] "0x102380d80" "2"          
[1] "0x105a8ff20" "2"          
[1] "0x105ae0d88" "2"

big.matrix對象可以在適當的位置修改嗎？
有沒有更好的方法來加載，處理，然后保存這些數據？ 目前的方法很慢！

Answer 1

有沒有更好的方法來加載，處理，然后保存這些數據？ 目前的方法很慢！

方法中最慢的部分是調用單獨讀取每一行。 我們可以“分塊”數據，或者一次讀取幾行，以便不會達到內存限制，同時可能加快速度。

這是計划：

弄清楚我們在文件中有多少行
閱讀這些行中的一大塊
在該塊上執行一些操作

將該塊重新推送到新文件中以便以后保存

 library(readr) # Make a file x <- data.frame(matrix(rnorm(10000),100000,10)) write_csv(x,"./test_set2.csv") # Create a function to read a variable in file and double it calcDouble <- function(calc.file,outputFile = "./outPut_File.csv", read.size=500000,variable="X1"){ # Set up variables num.lines <- 0 lines.per <- NULL var.top <- NULL i=0L # Gather column names and position of objective column connection.names <- file(calc.file,open="r+") data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1) close(connection.names) col.name <- which(colnames(data.names)==variable) #Find length of file by line connection.len <- file(calc.file,open="r+") while((linesread <- length(readLines(connection.len,read.size)))>0){ lines.per[i] <- linesread num.lines <- num.lines + linesread i=i+1L } close(connection.len) # Make connection for doubling function # Loop through file and double the set variables connection.double <- file(calc.file,open="r+") for (j in 1:length(lines.per)){ # if stops read.table from breaking # Read in a chunk of the file if (j == 1) { data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="") } else { data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="") } # Grab the columns we need and double them double <- data[,I(col.name)] * 2 if (j != 1) { write_csv(data.frame(double),outputFile,append = TRUE) } else { write_csv(data.frame(double),outputFile) } message(paste0("Reading from Chunk: ",j, " of ",length(lines.per))) } close(connection.double) } calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")

因此，我們使用操縱數據返回.csv文件。 您可以將double <- data[,I(col.name)] * 2更改為您需要對每個塊執行的操作。

使用bigmemory在R中加載大型數據集的內存問題

問題描述

1 個解決方案

解決方案1
2 2015-07-14 17:45:06

使用bigmemory在R中加載大型數據集的內存問題

問題描述

1 個解決方案

解決方案1 2 2015-07-14 17:45:06

解決方案1
2 2015-07-14 17:45:06