使用R中的bigmemory包读取缺少数据的大型CSV文件

Question

I am using large datasets for my research (4.72GB) and I discovered "bigmemory" package in R that supposedly handles large datasets (up to the range of 10GB). 我正在使用大型数据集进行研究（4.72GB），我发现R中的“ bigmemory”软件包可以处理大型数据集（最大10GB的范围）。 However, when I use read.big.matrix to read a csv file, I get the following error: 但是，当我使用read.big.matrix读取csv文件时，出现以下错误：

> x <- read.big.matrix("x.csv", type = "integer", header=TRUE, backingfile="file.bin", descriptorfile="file.desc")

Error in read.big.matrix("x.csv", type = "integer", header = TRUE,  
: Dimension mismatch between header row and first data row.

I think the issue is that the csv file is not full, ie, it is missing values in several cells. 我认为问题在于csv文件未满，即，它缺少几个单元格中的值。 I tried removing header = TRUE but then R aborts and restarts the session. 我尝试删除标头= TRUE，但是R终止并重新启动会话。

Does anyone have experience with reading large csv files with missing data using read.big.matrix? 有没有人有使用read.big.matrix读取缺少数据的大型csv文件的经验？

Answer 1

It may not be solving your problem directly, but you might find a package of mine filematrix useful. 它可能不能直接解决您的问题，但是您可能会发现我的filematrix包很有用。 The relevant function is fm.create.from.text.file . 相关功能是fm.create.from.text.file 。

Please let me know if it works for your data file. 请让我知道它是否适用于您的数据文件。

Answer 2

Did you check bigmemory PDF at https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf ? 您是否在https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf上检查了bigmemory PDF？

It was clearly described right there. 在那里有清楚的描述。

write.big.matrix(x, 'IrisData.txt', col.names=TRUE, row.names=TRUE)
y <- read.big.matrix("IrisData.txt", header=TRUE, has.row.names=TRUE)

# The following would fail with a dimension mismatch:
if (FALSE) y <- read.big.matrix("IrisData.txt", header=TRUE)

Basically, error means there is a column in the CSV file with row names. 基本上，错误意味着CSV文件中有一列带有行名。 If you don't pass has.row.names=TRUE , bigmemory will consider row names a separate column, and without header you'll get mismatch. 如果不传递has.row.names=TRUE ，则bigmemory会将行名视为单独的列，如果没有标题，则会出现不匹配的情况。

I personally found data.table package more useful for dealing with large data set cases, YMMV 我个人发现data.table包对于处理大型数据集案例YMMV更有用

使用R中的bigmemory包读取缺少数据的大型CSV文件

问题描述

2 个解决方案

解决方案1
1 2015-11-19 19:47:22

解决方案2
0 2015-11-19 19:58:37

使用R中的bigmemory包读取缺少数据的大型CSV文件

问题描述

2 个解决方案

解决方案1 1 2015-11-19 19:47:22

解决方案2 0 2015-11-19 19:58:37

解决方案1
1 2015-11-19 19:47:22

解决方案2
0 2015-11-19 19:58:37