简体   繁体   English

在 R data.table 中用 NA 替换所有空白的快速方法

[英]Fast way to replace all blanks with NA in R data.table

I have a large data.table object (1M rows and 220 columns) and I want to replace all blanks ('') with NA.我有一个大的 data.table 对象(1M 行和 220 列),我想用 NA 替换所有空格 ('')。 I found a solution in this Post , but it's extremely slow for my data table (takes already over 15mins) Example from the other post:我在这篇文章中找到了一个解决方案,但我的数据表非常慢(已经超过 15 分钟)来自另一篇文章的示例:

 data = data.frame(cats=rep(c('', ' ', 'meow'),1e6),
                   dogs=rep(c("woof", " ", NA),1e6))
 system.time(x<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))

Is there a more data.table fast way to achieve this?是否有更多 data.table 快速方法来实现这一目标?

Indeed the provided data does not look much like the original data, it was just to give an example.确实提供的数据看起来和原始数据不太一样,只是举个例子。 The following subset of my real data gives the CharToDate(x) error:我的真实数据的以下子集给出了 CharToDate(x) 错误:

DT <- data.table(ID=c(10),DEFAULT_DATE=as.Date("2012-07-31"),value='')
system.time(DT[DT=='']<-NA)

Here's probably the generic data.table way of doing this.这可能是执行此操作的通用data.table方式。 I'm also going to use your regex which handles several types of blanks (I havn't seen other answers doing this).我还将使用您的正则表达式来处理几种类型的空白(我还没有看到其他答案这样做)。 You probably shouldn't run this over all your columns rather only over the factor or character ones, because other classes won't accept blank values.您可能不应该在所有列上运行它,而应该只在factorcharacter列上运行,因为其他类不会接受空白值。

For factor s对于factor s

indx <- which(sapply(data, is.factor))
for (j in indx) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_integer_) 

For character s对于character s

indx2 <- which(sapply(data, is.character)) 
for (j in indx2) set(data, i = grep("^$|^ $", data[[j]]), j = j, value = NA_character_)

Use this approach:使用这种方法:

system.time(data[data==''|data==' ']<-NA)
  user  system elapsed 
  1.47    0.19    1.66 

system.time(y<-apply(data, 2, function(x) gsub("^$|^ $", NA, x)))
  user  system elapsed 
  3.41    0.20    3.64

Assuming you had mistake while populating your data, below is the solution using data.table which you used in tag.假设您在填充数据时出错,以下是使用您在标签中使用的 data.table 的解决方案。

library(data.table)
data = data.table(cats=rep(c('', ' ', 'meow'),1000000),dogs=rep(c("woof", " ", NA),1000000))
system.time(data[cats=='', cats := NA][dogs=='', dogs := NA])
#  user  system elapsed 
# 0.056   0.000   0.059 

If you have a lot of column see David's comment.如果您有很多专栏,请参阅 David 的评论。

在尝试了几种不同的方法后,我发现最快和最简单的选择是:

data[data==""] <- NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM