简体   繁体   English

在不超过我的 RAM 的情况下以块形式读取 20GB 文件 - R

[英]Read a 20GB file in chunks without exceeding my RAM - R

I'm currently trying to read a 20GB file.我目前正在尝试读取 20GB 的文件。 I only need 3 columns of that file.我只需要该文件的 3 列。 My problem is, that I'm limited to 16 GB of ram.我的问题是,我只能使用 16 GB 的内存。 I tried using readr and processing the data in chunks with the function read_csv_chunked and read_csv with the skip parameter, but those both exceeded my RAM limits.我尝试使用readr并使用函数read_csv_chunkedread_csv和 skip 参数以块的形式处理数据,但这些都超出了我的 RAM 限制。 Even the read_csv(file, ..., skip = 10000000, nrow = 1) call that reads one line uses up all my RAM.甚至读取一行的read_csv(file, ..., skip = 10000000, nrow = 1)调用也read_csv(file, ..., skip = 10000000, nrow = 1)了我所有的 RAM。

My question now is, how can I read this file?我现在的问题是,我怎样才能读取这个文件? Is there a way to read chunks of the file without using that much ram?有没有办法在不使用那么多内存的情况下读取文件块?

The LaF package can read in ASCII data in chunks. LaF包可以以块的形式读取 ASCII 数据。 It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.它可以直接使用,或者如果您使用dplyr,分块包使用它提供与 dplyr 一起使用的接口。

The readr package has readr_csv_chunked and related functions. readr包有readr_csv_chunked和相关函数。

The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.网页中名为The Loop 的部分以及该页面的后续部分描述了如何使用基础 R 进行分块读取。

It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.可能是,如果您删除除前三列之外的所有列,它会小到足以一次性读取并处理它。

vroom in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select= argument which may make it small enough to read it in in one go. vroom包中的vroom可以非常快速地读入文件,并且还能够仅读入select=参数中命名的列,这可能使它足够小,可以一口气读入。

fread in the data.table package is a fast reading function that also supports a select= argument which can select only specified columns. data.table包中的fread是一个快速读取函数,它也支持select=参数,它只能选择指定的列。

read.csv.sql in the sqldf (also see github page ) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. read.csv.sql (另见github页面)包中的read.csv.sql可以读取一个大于R可以处理的文件到一个临时的外部SQLite数据库中,它为你创建并随后删除并读取给R的SQL语句的结果。如果前三列被命名为 col1、col2 和 col3,请尝试下面的代码。 See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.有关取决于您的文件的其余参数,请参阅 ?read.csv.sql 和 ?sqldf 。

library(sqldf)
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file", 
  dbname = tempfile(), ...)

read.table and read.csv in the base of R have a colClasses= argument which takes a vector of column classes. R 基础中的read.tableread.csv有一个colClasses=参数,它采用列类向量。 If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3)) to only read the first 3 columns.如果文件有 nc 列,则使用colClasses = rep(c(NA, "NULL"), c(3, nc-3))仅读取前 3 列。

Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.另一种方法是使用 cut、sed 或 awk(在 UNIX 和 Windows 上的 Rtools bin 目录中本机可用)或许多免费命令行实用程序(例如 R 之外的csvfix )中的任何一个来预处理文件,以删除除前三列,然后看看它是否足够小,可以一口气读完。

Also check out the High Performance Computing task view.另请查看高性能计算任务视图。

We can try something like this, first a small example csv:我们可以尝试这样的事情,首先是一个小例子 csv:

X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
write.csv(X,"test.csv",quote=F,row.names=FALSE)

You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:您可以使用 nrow 函数,而不是提供文件,而是提供连接,并将结果存储在列表中,例如:

data = vector("list",200)

con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
dim(data[[1]])
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
head(data[[1]])

  id         X1        X2         X3
1  1 0.13870273 0.4480100 0.41655108
2  2 0.82249489 0.1227274 0.27173937
3  3 0.78684815 0.9125520 0.08783347
4  4 0.23481987 0.7643155 0.59345660
5  5 0.55759721 0.6009626 0.08112619
6  6 0.04274501 0.7234665 0.60290296

In the above, we read the first chunk, collected the colnames and subsetted.在上面,我们读取了第一个块,收集了列名并进行了子集化。 If you carry on reading through the connection, the headers will be missing, and we need to specify that:如果您继续阅读连接,标题将丢失,我们需要指定:

for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
}

Finally, we build of all of those into a data.frame:最后,我们将所有这些构建到一个 data.frame 中:

data = do.call(rbind,data)
all.equal(data[,1:3],X[,1:3])
[1] TRUE

You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work.你可以看到我指定了一个比需要的大得多的列表,这是为了显示如果你不知道文件有多长,当你指定更大的东西时,它应该可以工作。 This is a bit better than writing a while loop..这比写一个while循环要好一些..

So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:所以我们把它包装成一个函数,指定文件,一次读取的行数,次数,以及要子集的列名(或位置):

read_chunkcsv=function(file,rows_to_read,ntimes,col_subset){

    data = vector("list",rows_to_read)
    con = file(file,"r")
    data[[1]] = read.csv(con, nrows=rows_to_read)
    COLS = colnames(data[[1]])
    data[[1]] = data[[1]][,col_subset]

    for(i in 2:ntimes){
    data[[i]] = read.csv(con,
    nrows=rows_to_read,col.names=COLS,header=FALSE)[,col_subset]
    }

    return(do.call(rbind,data))
    }

all.equal(X[,1:3],
read_chunkcsv("test.csv",rows_to_read=10000,ntimes=10,1:3))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM