简体   繁体   中英

Fast way to replace all blanks with NA in R data.table

I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA. I found a solution in this Post , but it's extremely slow for my data table (takes already over 15mins) Example from the other post:

 data = data.frame(cats=rep(c('', ' ', 'meow'),1e6),
                   dogs=rep(c("woof", " ", NA),1e6))
 system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))

Is there a more data.table fast way to achieve this?

Indeed the provided data does not look much like the original data, it was just to give an example. The following subset of my real data gives the CharToDate(x) error:

DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='')
system.time(DT[DT=='']<-NA)

Here's probably the generic data.table way of doing this. I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this). You probably shouldn't run this over all your columns rather only over the factor or character ones, because other classes won't accept blank values.

For factor s

indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_) 

For character s

indx2 <- which(sapply(data, is.character)) 
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)

Use this approach:

system.time(data[data==''|data==' ']<-NA)
  user  system elapsed 
  1.47    0.19    1.66 

system.time(y<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
  user  system elapsed 
  3.41    0.20    3.64

Assuming you had mistake while populating your data, below is the solution using data.table which you used in tag.

library(data.table)
data = data.table(cats=rep(c('', ' ', 'meow'),1000000),dogs=rep(c("woof", " ", NA),1000000))
system.time(data[cats=='', cats := NA][dogs=='', dogs := NA])
#  user  system elapsed 
# 0.056   0.000   0.059 

If you have a lot of column see David's comment.

在尝试了几种不同的方法后,我发现最快和最简单的选择是:

data[data==""] <- NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM