在R中的data.table中选择NA

Question

How do I select all the rows that have a missing value in the primary key in a data table. 如何选择数据表中主键中缺少值的所有行。

DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)

Selecting for a particular value is easy 选择特定值很容易

DT["a",]

Selecting for the missing values seems to require a vector search. 选择缺失值似乎需要矢量搜索。 One cannot use binary search. 一个人不能使用二进制搜索。 Am I correct? 我对么？

DT[NA,]# does not work
DT[is.na(x),] #does work

Answer 1

Fortunately, DT[is.na(x),] is nearly as fast as (eg) DT["a",] , so in practice, this may not really matter much: 幸运的是， DT[is.na(x),]几乎和（例如） DT["a",]一样快，所以在实践中，这可能并不重要：

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)  

benchmark(DT["a",],
          DT[is.na(x),],
          replications=20)
#             test replications elapsed relative user.self sys.self user.child
# 1      DT["a", ]           20    9.18    1.000      7.31     1.83         NA
# 2 DT[is.na(x), ]           20   10.55    1.149      8.69     1.85         NA

=== ===

Addition from Matthew (won't fit in comment) : Matthew的补充（不适合评论）：

The data above has 3 very large groups, though. 不过，上述数据有3个非常大的群体。 So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied). 因此，二进制搜索的速度优势在于创建大型子集的时间占主导地位（复制了1/3的数据）。

benchmark(DT["a",],  # repeat select of large subset on my netbook
    DT[is.na(x),],
    replications=3)
          test replications elapsed relative user.self sys.self
     DT["a", ]            3   2.406    1.000     2.357    0.044
DT[is.na(x), ]            3   3.876    1.611     3.812    0.056

benchmark(DT["a",which=TRUE],   # isolate search time
    DT[is.na(x),which=TRUE],
    replications=3)
                      test replications elapsed relative user.self sys.self
     DT["a", which = TRUE]            3   0.492    1.000     0.492    0.000
DT[is.na(x), which = TRUE]            3   2.941    5.978     2.932    0.004

As the size of the subset returned decreases (eg adding more groups), the difference becomes apparent. 随着返回的子集的大小减小（例如，添加更多组），差异变得明显。 Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades. 单列上的矢量扫描也不错，但是在2列或更多列上它会快速降级。

Maybe NAs should be joinable to. 也许NAs应该可以加入。 I seem to remember a gotcha with that, though. 不过，我似乎还记得那个问题。 Here's some history linked from FR#1043 Allow or disallow NA in keys? 这是FR＃1043允许或禁止按键NA的一些历史记录？ . 。 It mentions there that NA_integer_ is internally a negative integer. 它在那里提到NA_integer_在内部是一个负整数。 That trips up radix/counting sort (iirc) resulting in setkey going slower. 这会导致基数/计数排序（iirc） setkey ，导致setkey变慢。 But it's on the list to revisit. 但它在重新审视的名单上。

Answer 2

This is now implemented in v1.8.11. 现在在v1.8.11中实现了这一点。 From NEWS : 来自新闻：

o Binary search is now capable of subsetting NA / NaN s and also perform joins and merges by matching NA s/ NaN s. o二进制搜索现在能够对NA / NaN进行子集化，并且还通过匹配NA s / NaN执行joins和merges 。

Although you'll have to provide the correct NA ( NA_real_ , NA_character_ etc..) explicitly at the moment. 虽然您现在必须明确提供正确的NA （ NA_real_ ， NA_character_等..）。

On OP's data: 关于OP的数据：

DT[J(NA_character_)] # or for characters simply DT[NA_character_]
#     x y v
# 1: NA 1 7
# 2: NA 3 8
# 3: NA 6 9

Also, here's the same benchmark from @JoshOBrien's post, with this binary search for NA added: 此外，这里是@ JoshOBrien的帖子中的相同基准，这个NA的二进制搜索添加了：

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)  

benchmark(DT["a",],
          DT[is.na(x),],
          DT[NA_character_], 
          replications=20)

            test replications elapsed relative user.self sys.self
1      DT["a", ]           20   4.763    1.238     4.000    0.567
2 DT[is.na(x), ]           20   5.399    1.403     4.537    0.794
3         DT[NA]           20   3.847    1.000     3.215    0.600 # <~~~

在R中的data.table中选择NA

问题描述

2 个解决方案

解决方案1
22 已采纳 2012-09-28 20:01:03

解决方案2
19 2013-12-15 01:06:41

This is now implemented in v1.8.11. 现在在v1.8.11中实现了这一点。 From NEWS : 来自新闻：

在R中的data.table中选择NA

问题描述

2 个解决方案

解决方案1 22 已采纳 2012-09-28 20:01:03

解决方案2 19 2013-12-15 01:06:41

This is now implemented in v1.8.11. 现在在v1.8.11中实现了这一点。 From NEWS : 来自新闻 ：

解决方案1
22 已采纳 2012-09-28 20:01:03

解决方案2
19 2013-12-15 01:06:41

This is now implemented in v1.8.11. 现在在v1.8.11中实现了这一点。 From NEWS : 来自新闻：