简体   繁体   中英

Reading large csv file with missing data using bigmemory package in R

I am using large datasets for my research (4.72GB) and I discovered "bigmemory" package in R that supposedly handles large datasets (up to the range of 10GB). However, when I use read.big.matrix to read a csv file, I get the following error:

> x <- read.big.matrix("x.csv", type = "integer", header=TRUE, backingfile="file.bin", descriptorfile="file.desc")

Error in read.big.matrix("x.csv", type = "integer", header = TRUE,  
: Dimension mismatch between header row and first data row.

I think the issue is that the csv file is not full, ie, it is missing values in several cells. I tried removing header = TRUE but then R aborts and restarts the session.

Does anyone have experience with reading large csv files with missing data using read.big.matrix?

It may not be solving your problem directly, but you might find a package of mine filematrix useful. The relevant function is fm.create.from.text.file .

Please let me know if it works for your data file.

Did you check bigmemory PDF at https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf ?

It was clearly described right there.

write.big.matrix(x, 'IrisData.txt', col.names=TRUE, row.names=TRUE)
y <- read.big.matrix("IrisData.txt", header=TRUE, has.row.names=TRUE)

# The following would fail with a dimension mismatch:
if (FALSE) y <- read.big.matrix("IrisData.txt", header=TRUE)

Basically, error means there is a column in the CSV file with row names. If you don't pass has.row.names=TRUE , bigmemory will consider row names a separate column, and without header you'll get mismatch.

I personally found data.table package more useful for dealing with large data set cases, YMMV

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM