简体   繁体   English

使用read.csv.ffdf()会引发错误

[英]Using read.csv.ffdf() throws an error

I'm trying to read in a large (3.7 million rows, 180 columns) dataset into R, using the ff package. 我正在尝试使用ff包将大型(370万行,180列)数据集读入R中。 There are several data types in the dataset - factor, logical, and numeric. 数据集中有多种数据类型 - 因子,逻辑和数字。

The problem is when reading in numeric variables. 问题是读取数字变量时。 For example, one of my columns is: 例如,我的一个列是:

TotalBeforeTax
126.9
88.0
124.5
90.9
...

When I try reading the data in, the following error is thrown: 当我尝试读取数据时,会抛出以下错误:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"126.90000"'

I tried declaring the class to integer (it's already declared as numeric ) using the colClasses argument, but to no avail. 我尝试使用colClasses参数将类声明为integer (它已经声明为numeric ),但无济于事。 I also tried changing it to a real (whatever that is supposed to mean), and it starts reading in the data, but at some point throws: 我也尝试将它改为a real (无论应该是什么意思),它开始读取数据,但在某些时候抛出:

Error in methods::as(data[[i]], colClasses[i]) : 
  no method or default for coercing “character” to “a real”

(My guess is, because it comes across an NA and doesn't know what to do with it.) (我的猜测是,因为它遇到了一个NA并且不知道如何处理它。)

The funny thing is, if I declare the column as a factor , everything reads in nicely. 有趣的是,如果我将列声明为一个factor ,那么一切都很好。

What gives? 是什么赋予了?

OK, so I managed to solve this using a primitive workaround. 好的,所以我设法使用原始的解决方法来解决这个问题。 First, split the .csv file using a csv file splitter application. 首先,使用csv文件拆分器应用程序拆分.csv文件。 Then, execute the following code: 然后,执行以下代码:

## First, set the folder where the split .csv files are. Set the file names.

sourceDir <- "split_files_folder"
sourceFile <- paste(sourceDir,"common_name_of_split_files", sep = "/")

## Now set the number of split pieces.

pieces <- "some_number"

## Set the destination folder for the tab-delimited text files. 
## Set the output file name.

destDir <- "destination_folder"
destFile <- paste(paste(destDir, "datafile", sep = "/"), "txt", sep = ".")

## Now, initialize the loop.

for (i in 1:pieces)
{
  temp <- read.csv(file = paste(paste(sourceFile, i, sep = "_"), "csv", sep = "."))
  if (i == 1) 
  {
    write.table(temp, file = destFile, quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
  }
  else 
  {
    write.table(temp, file = destFile, append = TRUE, quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
  }
}

And voila! 瞧! You've got a huge tab-delimited text file! 你有一个巨大的制表符分隔的文本文件!

Solution 1 解决方案1

You could try laf_to_ffdf from the ffbase package. 你可以尝试laf_to_ffdfffbase包。 Something like: 就像是:

library(LaF)
library(ffbase)

con <- laf_open_csv("yourcsvfile.csv", 
  column_names = [as character vector with column names], 
  column_types = [a character vector with colClasses], 
  dec=".", sep=",", skip=1)

ffdf <- laf_to_ffdf(con)

Or if you want to detect the types automatically: 或者,如果要自动检测类型:

library(LaF)
library(ffbase)

m <- detect_dm_csv("yourcsvfile.csv")
con <- laf_open(m)
ffdf <- laf_to_ffdf(con)

Solution 2 解决方案2

Use a column class of character for the offending column and cast the column to numeric in transFUN argument of read.csv.ffdf : 对违规列使用列类character ,并在transFUN参数read.csv.ffdf列转换为数字:

ffdf <- read.csv.ffdf([your regular arguments], transFUN = function(d) {
  d$offendingcolumn <- as.numeric(d$offendingcolumn)
  d
})

问题似乎是被引号包围的数字126.9000“。所以也许你应该首先将变量作为字符,然后删除所有不需要的字符,最后将变量转换为数字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM