[英]Memory problems using bigmemory to load large dataset in R
I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. 我有一个大文本文件(> 1000万行,> 1 GB),我希望一次处理一行,以避免将整个内容加载到内存中。 After processing each line I wish to save some variables into a
big.matrix
object. 处理
big.matrix
每一行后,我希望将一些变量保存到big.matrix
对象中。 Here is a simplified example: 这是一个简化的例子:
library(bigmemory)
library(pryr)
con <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')
for (i in 1:5){
print(c(address(x), refs(x)))
y <- readLines(con, n = 1, warn = FALSE)
x[i] <- 2L*as.integer(y)
}
close(con)
where x.csv
contains 其中
x.csv
包含
4
18
2
14
16
Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix
object and it appears to change with each loop iteration: 按照这里的建议http://adv-r.had.co.nz/memory.html我打印了我的
big.matrix
对象的内存地址,它似乎随着每次循环迭代而改变:
[1] "0x101e854d8" "2"
[1] "0x101d8f750" "2"
[1] "0x102380d80" "2"
[1] "0x105a8ff20" "2"
[1] "0x105ae0d88" "2"
Can big.matrix
objects be modified in place? big.matrix
对象可以在适当的位置修改吗?
is there a better way to load, process and then save these data? 有没有更好的方法来加载,处理,然后保存这些数据? The current method is slow!
目前的方法很慢!
- is there a better way to load, process and then save these data?
有没有更好的方法来加载,处理,然后保存这些数据? The current method is slow!
目前的方法很慢!
The slowest part of your method appearts to be making the call to read each line individually. 方法中最慢的部分是调用单独读取每一行。 We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.
我们可以“分块”数据,或者一次读取几行,以便不会达到内存限制,同时可能加快速度。
Here's the plan: 这是计划:
Push that chunk back into a new file to save for later 将该块重新推送到新文件中以便以后保存
library(readr) # Make a file x <- data.frame(matrix(rnorm(10000),100000,10)) write_csv(x,"./test_set2.csv") # Create a function to read a variable in file and double it calcDouble <- function(calc.file,outputFile = "./outPut_File.csv", read.size=500000,variable="X1"){ # Set up variables num.lines <- 0 lines.per <- NULL var.top <- NULL i=0L # Gather column names and position of objective column connection.names <- file(calc.file,open="r+") data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1) close(connection.names) col.name <- which(colnames(data.names)==variable) #Find length of file by line connection.len <- file(calc.file,open="r+") while((linesread <- length(readLines(connection.len,read.size)))>0){ lines.per[i] <- linesread num.lines <- num.lines + linesread i=i+1L } close(connection.len) # Make connection for doubling function # Loop through file and double the set variables connection.double <- file(calc.file,open="r+") for (j in 1:length(lines.per)){ # if stops read.table from breaking # Read in a chunk of the file if (j == 1) { data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="") } else { data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="") } # Grab the columns we need and double them double <- data[,I(col.name)] * 2 if (j != 1) { write_csv(data.frame(double),outputFile,append = TRUE) } else { write_csv(data.frame(double),outputFile) } message(paste0("Reading from Chunk: ",j, " of ",length(lines.per))) } close(connection.double) } calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
So we get back a .csv file with the manipulated data. 因此,我们使用操纵数据返回.csv文件。 You can change
double <- data[,I(col.name)] * 2
to whatever thing you need to do to each chunk. 您可以将
double <- data[,I(col.name)] * 2
更改为您需要对每个块执行的操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.