read.csv2.ffdf正在导入一个数字（浮点）变量作为因子

Question

I have a couple weeks working with the ff package and it has been working great so far, but today I realized that a variable that should be numeric is being readed as a factor. 我有几个星期使用ff包，它到目前为止一直很好用，但今天我意识到应该是数字的变量被作为一个因素。 The data has about 900k rows and 800 col, so it's not easy to control that every column gets the class that it should... 数据有大约900k行和800 col，因此控制每列都应该得到它应该的类是不容易的...

matff <- read.csv2.ffdf(file = name,encoding = "UTF-8",next.rows=150000,colClasses=NA)

I would like to know why may this be happening and an idea on how to fix it. 我想知道为什么会发生这种情况以及如何解决这个问题。

Thanks. 谢谢。

Answer 1

Your data has some columns which are clearly texts and not numeric data as you expect it. 您的数据有一些列是明确的文本，而不是您期望的数字数据。

You can use the transFUN argument to read.csv2.ffdf to solve your decimal problem. 您可以使用trans.cUN参数read.csv2.ffdf来解决小数问题。 As in 如在

transFUN=function(x){
  x$mycolumn <- as.numeric(gsub(",", ".", as.character(x$mycolumn)))
  x
}

Or use the appropriate read.table arguments. 或者使用适当的read.table参数。

Answer 2

Now it should work: 现在它应该工作：

# matff <- data.frame(Col=c('a','b','c'),Mix1=c('a','1.2','c'),Mix2=c(1.1,2.1,3),Num1=c('1.2','2.3','3.4'),Num2=c('1,2','2,3','3,4')) # Data example

func <- function(x) {
 if (class(x) != 'numeric') {
  x <- levels(x)[x]
  if (length(grep('[a-zA-Z]',x,invert=T)) == length(x)) { x <- as.real(gsub(',','\\.',x)) }
  else { x <- factor(x) }
 }
 x
}

for (i in 1:ncol(matff)) {
 matff[,i] <- func(matff[,i])
}

read.csv2.ffdf正在导入一个数字（浮点）变量作为因子

问题描述

2 个解决方案

解决方案1
1

解决方案2
1 2013-02-18 13:18:23

read.csv2.ffdf正在导入一个数字（浮点）变量作为因子

问题描述

2 个解决方案

解决方案1 1

解决方案2 1 2013-02-18 13:18:23

解决方案1
1

解决方案2
1 2013-02-18 13:18:23