简体   繁体   中英

R data.table replace “NULL” with `NA` when columns are factors

I pulled some data from a SQL database through ODBC and the columns are automatically set to factor . It is something like the following:

library(RODBC)
library(data.table)
data <- data.table(sqlQuery(channel, query))

My data looks like this, just with a lot more columns:

data <- data.table("C1"=as.factor(c(letters[1:4], "NULL", letters[5])),
                   "C2"=as.factor(c(rnorm(3), "NULL", rnorm(2))),
                   "C3"=as.factor(c(letters[1], "NULL", letters[2:4], "NULL")))
> data
     C1                 C2   C3
1:    a -0.190200079604691    a
2:    b  0.310548914832963 NULL
3:    c 0.0153099116493453    b
4:    d               NULL    c
5: NULL  0.157187027626419    d
6:    e  0.118537540781528 NULL
> str(data)
Classes ‘data.table’ and 'data.frame':  6 obs. of  3 variables:
 $ C1: Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 6 5
 $ C2: Factor w/ 6 levels "-0.190200079604691",..: 1 5 2 6 4 3
 $ C3: Factor w/ 5 levels "a","b","c","d",..: 1 5 2 3 4 5
 - attr(*, ".internal.selfref")=<externalptr> 

How can I replace the "NULL" with NA ? Here I want R to treat these SQL "NULL" strings as missing values NA . I tried the following, but seems NA causes problems.

for (col in names(data)) {
  set(data, which(data[[col]]=="NULL"), col, NA)
}

> Error in set(data, which(data[[col]] == "NULL"), col, NA) : 
  Can't assign to column 'C1' (type 'factor') a value of type 'logical' (not character, factor, integer or numeric)

RODBC Solution

Thanks to @user20650's suggestion, you can control missing values from sqlQuery by doing data <- data.table(sqlQuery(channel, query, na.strings=c("NA", "NULL"))) . However, it is still possible to have this problem if your data source is formatted incorrectly, so this is not an universal solution to the post.

This has the desired effect and is much more compact:

is.na(data) <- data == "NULL"

Note re: comment Q: the is.na function is quite different than the is.na<- function. The latter one used here is an assignment of a value of NA to items that are defined by the logical expression on hte RHS of the assigment operator. There is an is.na.data.frame -method but not an is.na[<-.dataframe -method. So not really sure if this is a purely by-reference strategy, since it is not implemented with the [.data.frame syntax. It may be using "is.na<-.default".

I think after noodling around a bit more that "is.na<-.default" (which is just {x[value] <- NA; x} ) so will end up dispatching this call to [<-.data.table so it probably will be done "by reference".

Here's one way:

data[,names(data):=lapply(.SD,function(x){
  z <- levels(x)
  z[z=="NULL"] <- NA
  `levels<-`(x,z)
})]

To see what happened, look at lapply(data,levels) and you'll see that "NULL" is gone.


(Thanks, @akrun:) A more concise and intuitive variant is available using the car package:

library(car)
data[,names(data):=lapply(.SD, recode, '"NULL"=NA')]

In the data.table world, it's usually possible to modify by reference. In this case, that looks like...

for (j in names(data)) setattr(data[[j]],"levels",{
  z <- levels(data[[j]])
  z[z=="NULL"] <- NA
  z
})

This avoids making a copy of the whole vector, as `levels<-` does.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM