简体   繁体   English

使用read.table.ffdf将很大的csv文件读入R时,如何指定colClasses?

[英]How to specify colClasses when reading a very big csv file into R using read.table.ffdf?

I am trying to read a very big .csv file, of size around 20G, using the function read.table.ffdf() in the "ff" package, but had trouble in specifying the colClasses option in read.csv(). 我正在尝试使用“ ff”包中的read.table.ffdf()函数读取一个很大的.csv文件,大小约为20G,但是在read.csv()中指定colClasses选项时遇到了麻烦。

I have to specify the colClasses option because some columns in the file are labels as very long integers, eg with 11 digits. 我必须指定colClasses选项,因为文件中的某些列将标签标记为非常长的整数,例如11位数字。 For example, two rows in the file are 例如,文件中的两行是

86246,205,17,1719,104116343,8435,2013-03-13,12,OZ,1,2.59
86246,205,17,1719,10800749282,8435,2013-03-13,12,OZ,1,2.59 

The integer 10800749282 is too large for the type "integer" and can only be handled as either "numeric" or "character". 整数10800749282对于“整数”类型而言太大,只能作为“数字”或“字符”来处理。 But the value 104116343 in the above row is not large enough, so R by default will treat this column being "integer". 但是上一行中的值104116343不够大,因此默认情况下R将将此列视为“整数”。

I tried the following but got an error. 我尝试了以下操作,但出现错误。 Does anyone know how to solve this problem? 有谁知道如何解决这个问题? Highly appreciated! 高度赞赏!

dat <- read.table.ffdf(file="file.csv", FUN = "read.csv", na.strings = "", colClasses="character")

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'character' not implemented ff中的错误(initdata = initdata,长度=长度,级别=级别,有序=有序,:未实现vmode'字符'

As your error suggests, there is no 'character' data type implemented within the ff environment. 正如您的错误所暗示的那样,在ff环境中没有实现“字符”数据类型。 All characters should be treated as factors . 所有字符都应视为因素 Assuming your file contains x number of columns, the below works: 假设您的文件包含x列数,则可以使用以下内容:

dat <- read.csv.ffdf(NULL, file="file.csv", na.strings = "", colClasses=rep("factor", x))

However, what you probably need is not to import all data as factors, as it is extremely inefficient. 不过,你可能需要的是导入所有数据的因素,因为它是非常低效的。 Just import all your numerical data as 'numeric'. 只需将所有数字数据导入为“数字”即可。 Assuming your first 5 columns are numeric and the rest 3 are characters: 假设您的前5列是数字,其余3列是字符:

dat <- read.csv.ffdf(NULL, file="file.csv", na.strings = "", colClasses=c(rep("numeric", 5), rep("factor", 3)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM