简体   繁体   English

在大型 data.table 中更换 NA 的最快方法

[英]Fastest way to replace NAs in a large data.table

I have a large data.table , with many missing values scattered throughout its ~200k rows and 200 columns.我有一个大的data.table ,其中有许多缺失值分散在它的 ~200k 行和 200 列中。 I would like to re code those NA values to zeros as efficiently as possible.我想尽可能有效地将这些 NA 值重新编码为零。

I see two options:我看到两个选项:
1: Convert to a data.frame, and use something like this 1:转换为data.frame,并使用类似这样的东西
2: Some kind of cool data.table sub setting command 2:某种酷炫的data.table子设置指令

I'll be happy with a fairly efficient solution of type 1. Converting to a data.frame and then back to a data.table won't take too long.我会对类型 1 的相当有效的解决方案感到满意。转换为 data.frame 然后返回 data.table 不会花费太长时间。

Here's a solution using data.table 's := operator, building on Andrie and Ramnath's answers.这是使用data.table:=运算符的解决方案,基于 Andrie 和 Ramnath 的答案。

require(data.table)  # v1.6.6
require(gdata)       # v2.8.2

set.seed(1)
dt1 = create_dt(2e5, 200, 0.1)
dim(dt1)
[1] 200000    200    # more columns than Ramnath's answer which had 5 not 200

f_andrie = function(dt) remove_na(dt)

f_gdata = function(dt, un = 0) gdata::NAToUnknown(dt, un)

f_dowle = function(dt) {     # see EDIT later for more elegant solution
  na.replace = function(v,value=0) { v[is.na(v)] = value; v }
  for (i in names(dt))
    eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
}

system.time(a_gdata = f_gdata(dt1)) 
   user  system elapsed 
 18.805  12.301 134.985 

system.time(a_andrie = f_andrie(dt1))
Error: cannot allocate vector of size 305.2 Mb
Timing stopped at: 14.541 7.764 68.285 

system.time(f_dowle(dt1))
  user  system elapsed 
 7.452   4.144  19.590     # EDIT has faster than this

identical(a_gdata, dt1)   
[1] TRUE

Note that f_dowle updated dt1 by reference.请注意, f_dowle 通过引用更新了 dt1。 If a local copy is required then an explicit call to the copy function is needed to make a local copy of the whole dataset.如果需要本地副本,则需要显式调用copy函数来制作整个数据集的本地副本。 data.table's setkey , key<- and := do not copy-on-write. data.table 的setkeykey<-:=不要写时复制。

Next, let's see where f_dowle is spending its time.接下来,让我们看看 f_dowle 把时间花在哪里了。

Rprof()
f_dowle(dt1)
Rprof(NULL)
summaryRprof()
$by.self
                  self.time self.pct total.time total.pct
"na.replace"           5.10    49.71       6.62     64.52
"[.data.table"         2.48    24.17       9.86     96.10
"is.na"                1.52    14.81       1.52     14.81
"gc"                   0.22     2.14       0.22      2.14
"unique"               0.14     1.36       0.16      1.56
... snip ...

There, I would focus on na.replace and is.na , where there are a few vector copies and vector scans.在那里,我将专注于na.replaceis.na ,其中有一些矢量副本和矢量扫描。 Those can fairly easily be eliminated by writing a small na.replace C function that updates NA by reference in the vector.通过编写一个小的 na.replace C 函数,通过向量中的引用更新NA ,这些可以很容易地消除。 That would at least halve the 20 seconds I think.这至少会使我认为的 20 秒减半。 Does such a function exist in any R package?任何 R 包中都存在这样的函数吗?

The reason f_andrie fails may be because it copies the whole of dt1 , or creates a logical matrix as big as the whole of dt1 , a few times. f_andrie失败的原因可能是因为它复制了整个dt1 ,或者创建了与整个dt1一样大的逻辑矩阵几次。 The other 2 methods work on one column at a time (although I only briefly looked at NAToUnknown ).其他 2 种方法一次处理一列(尽管我只简要地看了NAToUnknown )。

EDIT (more elegant solution as requested by Ramnath in comments) :编辑(Ramnath 在评论中要求的更优雅的解决方案):

f_dowle2 = function(DT) {
  for (i in names(DT))
    DT[is.na(get(i)), (i):=0]
}

system.time(f_dowle2(dt1))
  user  system elapsed 
 6.468   0.760   7.250   # faster, too

identical(a_gdata, dt1)   
[1] TRUE

I wish I did it that way to start with!我希望我一开始就这样做!

EDIT2 (over 1 year later, now) EDIT2 (一年多后,现在)

There is also set() .还有set() This can be faster if there are a lot of column being looped through, as it avoids the (small) overhead of calling [,:=,] in a loop.如果有很多列被循环,这会更快,因为它避免了在循环中调用[,:=,]的(小)开销。 set is a loopable := . set是一个可循环的:= See ?set .参见?set

f_dowle3 = function(DT) {
  # either of the following for loops

  # by name :
  for (j in names(DT))
    set(DT,which(is.na(DT[[j]])),j,0)

  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

Here's the simplest one I could come up with:这是我能想到的最简单的一个:

dt[is.na(dt)] <- 0

It's efficient and no need to write functions and other glue code.效率高,无需编写函数和其他胶水代码。

Dedicated functions ( nafill and setnafill ) for that purpose are available in data.table package (version >= 1.12.4):为此目的的专用函数( nafillsetnafill )在data.table包(版本 >= 1.12.4)中可用:

It process columns in parallel so well address previously posted benchmarks, below its timings vs fastest approach till now, and also scaled up, using 40 cores machine.它并行处理列,因此可以很好地解决以前发布的基准测试,低于其计时与迄今为止最快的方法,并且还使用 40 核机器进行了扩展。

library(data.table)
create_dt <- function(nrow=5, ncol=5, propNA = 0.5){
  v <- runif(nrow * ncol)
  v[sample(seq_len(nrow*ncol), propNA * nrow*ncol)] <- NA
  data.table(matrix(v, ncol=ncol))
}
f_dowle3 = function(DT) {
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

set.seed(1)
dt1 = create_dt(2e5, 200, 0.1)
dim(dt1)
#[1] 200000    200
dt2 = copy(dt1)
system.time(f_dowle3(dt1))
#   user  system elapsed 
#  0.193   0.062   0.254 
system.time(setnafill(dt2, fill=0))
#   user  system elapsed 
#  0.633   0.000   0.020   ## setDTthreads(1) elapsed: 0.149
all.equal(dt1, dt2)
#[1] TRUE

set.seed(1)
dt1 = create_dt(2e7, 200, 0.1)
dim(dt1)
#[1] 20000000    200
dt2 = copy(dt1)
system.time(f_dowle3(dt1))
#   user  system elapsed 
# 22.997  18.179  41.496
system.time(setnafill(dt2, fill=0))
#   user  system elapsed 
# 39.604  36.805   3.798 
all.equal(dt1, dt2)
#[1] TRUE
library(data.table)

DT = data.table(a=c(1,"A",NA),b=c(4,NA,"B"))

DT
    a  b
1:  1  4
2:  A NA
3: NA  B

DT[,lapply(.SD,function(x){ifelse(is.na(x),0,x)})]
   a b
1: 1 4
2: A 0
3: 0 B

Just for reference, slower compared to gdata or data.matrix, but uses only the data.table package and can deal with non numerical entries.仅供参考,比 gdata 或 data.matrix 慢,但仅使用 data.table 包并且可以处理非数字条目。

Here is a solution using NAToUnknown in the gdata package.这是在gdata包中使用NAToUnknown的解决方案。 I have used Andrie's solution to create a huge data table and also included time comparisons with Andrie's solution.我使用 Andrie 的解决方案创建了一个巨大的数据表,还包括与 Andrie 的解决方案的时间比较。

# CREATE DATA TABLE
dt1 = create_dt(2e5, 200, 0.1)

# FUNCTIONS TO SET NA TO ZERO   
f_gdata  = function(dt, un = 0) gdata::NAToUnknown(dt, un)
f_Andrie = function(dt) remove_na(dt)

# COMPARE SOLUTIONS AND TIMES
system.time(a_gdata  <- f_gdata(dt1))

user  system elapsed 
4.224   2.962   7.388 

system.time(a_andrie <- f_Andrie(dt1))

 user  system elapsed 
4.635   4.730  20.060 

identical(a_gdata, g_andrie)  

TRUE

My understanding is that the secret to fast operations in R is to utilise vector (or arrays, which are vectors under the hood.)我的理解是,在 R 中快速操作的秘诀是利用向量(或数组,它们是引擎盖下的向量。)

In this solution I make use of a data.matrix which is an array but behave a bit like a data.frame .在这个解决方案中,我使用了一个data.matrix ,它是一个array但行为有点像data.frame Because it is an array, you can use a very simple vector substitution to replace the NA s:因为它是一个数组,您可以使用非常简单的向量替换来替换NA

A little helper function to remove the NA s.删除NA的小辅助函数。 The essence is a single line of code.本质是一行代码。 I only do this to measure execution time.我这样做只是为了衡量执行时间。

remove_na <- function(x){
  dm <- data.matrix(x)
  dm[is.na(dm)] <- 0
  data.table(dm)
}

A little helper function to create a data.table of a given size.用于创建给定大小的data.table的小辅助函数。

create_dt <- function(nrow=5, ncol=5, propNA = 0.5){
  v <- runif(nrow * ncol)
  v[sample(seq_len(nrow*ncol), propNA * nrow*ncol)] <- NA
  data.table(matrix(v, ncol=ncol))
}

Demonstration on a tiny sample:小样本演示:

library(data.table)
set.seed(1)
dt <- create_dt(5, 5, 0.5)

dt
            V1        V2        V3        V4        V5
[1,]        NA 0.8983897        NA 0.4976992 0.9347052
[2,] 0.3721239 0.9446753        NA 0.7176185 0.2121425
[3,] 0.5728534        NA 0.6870228 0.9919061        NA
[4,]        NA        NA        NA        NA 0.1255551
[5,] 0.2016819        NA 0.7698414        NA        NA

remove_na(dt)
            V1        V2        V3        V4        V5
[1,] 0.0000000 0.8983897 0.0000000 0.4976992 0.9347052
[2,] 0.3721239 0.9446753 0.0000000 0.7176185 0.2121425
[3,] 0.5728534 0.0000000 0.6870228 0.9919061 0.0000000
[4,] 0.0000000 0.0000000 0.0000000 0.0000000 0.1255551
[5,] 0.2016819 0.0000000 0.7698414 0.0000000 0.0000000

For the sake of completeness, another way to replace NAs with 0 is to use为了完整起见,另一种用 0 替换 NAs 的方法是使用

f_rep <- function(dt) {
dt[is.na(dt)] <- 0
return(dt)
}

To compare results and times I have incorporated all approaches mentioned so far.为了比较结果和时间,我结合了迄今为止提到的所有方法。

set.seed(1)
dt1 <- create_dt(2e5, 200, 0.1)
dt2 <- dt1
dt3 <- dt1

system.time(res1 <- f_gdata(dt1))
   User      System verstrichen 
   3.62        0.22        3.84 
system.time(res2 <- f_andrie(dt1))
   User      System verstrichen 
   2.95        0.33        3.28 
system.time(f_dowle2(dt2))
   User      System verstrichen 
   0.78        0.00        0.78 
system.time(f_dowle3(dt3))
   User      System verstrichen 
   0.17        0.00        0.17 
system.time(res3 <- f_unknown(dt1))
   User      System verstrichen 
   6.71        0.84        7.55 
system.time(res4 <- f_rep(dt1))
   User      System verstrichen 
   0.32        0.00        0.32 

identical(res1, res2) & identical(res2, res3) & identical(res3, res4) & identical(res4, dt2) & identical(dt2, dt3)
[1] TRUE

So the new approach is slightly slower than f_dowle3 but faster than all the other approaches.所以新方法比f_dowle3但比所有其他方法都快。 But to be honest, this is against my Intuition of the data.table Syntax and I have no idea why this works.但老实说,这违背了我对 data.table 语法的直觉,我不知道为什么会这样。 Can anybody enlighten me?有人可以启发我吗?

Using the fifelse function from the newest data.table versions 1.12.6, it is even 10 times faster than NAToUnknown in the gdata package:使用最新的data.table版本 1.12.6 中的fifelse函数,它甚至比gdata包中的NAToUnknown快 10 倍:

z = data.table(x = sample(c(NA_integer_, 1), 2e7, TRUE))
system.time(z[,x1 := gdata::NAToUnknown(x, 0)])

#   user  system elapsed 
#  0.798   0.323   1.173 
system.time(z[,x2:= fifelse(is.na(x), 0, x)])

#   user  system elapsed 
#  0.172   0.093   0.113 

To generalize to many columns you could use this approach (using previous sample data but adding a column):要推广到许多列,您可以使用这种方法(使用以前的示例数据但添加一列):

z = data.table(x = sample(c(NA_integer_, 1), 2e7, TRUE), y = sample(c(NA_integer_, 1), 2e7, TRUE))

z[, names(z) := lapply(.SD, function(x) fifelse(is.na(x), 0, x))]

Didn't test for the speed though虽然没有测试速度

> DT = data.table(a=LETTERS[c(1,1:3,4:7)],b=sample(c(15,51,NA,12,21),8,T),key="a")
> DT
   a  b
1: A 12
2: A NA
3: B 15
4: C NA
5: D 51
6: E NA
7: F 15
8: G 51
> DT[is.na(b),b:=0]
> DT
   a  b
1: A 12
2: A  0
3: B 15
4: C  0
5: D 51
6: E  0
7: F 15
8: G 51
> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM