简体   繁体   English

data.table如何在键列上排序NA值?

[英]How does data.table sort NA values on key columns?

Using data.table, say I'm setting the key using two columns, and one of the columns has missing values. 使用data.table,假设我使用两列设置​​密钥,其中一列缺少值。 Data table seems to sort the NA values to the first values. 数据表似乎将NA值排序为第一个值。

require(data.table)
set.seed(919)

# Create sample data
dt <- data.table(
  key1 = rep(1:10, each = 10),
  key2 = rep_len(letters, 100)
  )

# Set some key2 values to missing
dt[sample(1:100, 10), "key2"] <- NA

# Set key (sort)
setkeyv(dt, c("key1", "key2"))
dt
# 1:    1   NA
# 2:    1    a
# 3:    1    b
# 4:    1    c
# 5:    1    d
# 6:    1    f
# 7:    1    g
# 8:    1    h
# 9:    1    i
# 10:    1    j
# 11:    2   NA
# 12:    2   NA
# 13:    2    k
# 14:    2    m
# 15:    2    n
# 16:    2    o
# 17:    2    p
# 18:    2    q
# 19:    2    r
# 20:    2    s
# 21:    3    a
# 22:    3    b
# 23:    3    c
# 24:    3    d
# 25:    3    u
# 26:    3    v
# 27:    3    w
# 28:    3    x
# 29:    3    y
# 30:    3    z
# 31:    4    e
# 32:    4    f
# 33:    4    g
# 34:    4    h
# 35:    4    i
# 36:    4    j
# 37:    4    k
# 38:    4    l
# 39:    4    m
# 40:    4    n
# 41:    5   NA
# 42:    5   NA
# 43:    5    o
# 44:    5    q
# 45:    5    r
# 46:    5    s
# 47:    5    u
# 48:    5    v
# 49:    5    w
# 50:    5    x
# 51:    6   NA
# 52:    6    a
# 53:    6    b
# 54:    6    c
# 55:    6    d
# 56:    6    e
# 57:    6    g
# 58:    6    h
# 59:    6    y
# 60:    6    z
# 61:    7    i
# 62:    7    j
# 63:    7    k
# 64:    7    l
# 65:    7    m
# 66:    7    n
# 67:    7    o
# 68:    7    p
# 69:    7    q
# 70:    7    r
# 71:    8   NA
# 72:    8   NA
# 73:    8    a
# 74:    8    b
# 75:    8    t
# 76:    8    u
# 77:    8    w
# 78:    8    x
# 79:    8    y
# 80:    8    z
# 81:    9   NA
# 82:    9    c
# 83:    9    d
# 84:    9    e
# 85:    9    f
# 86:    9    h
# 87:    9    i
# 88:    9    j
# 89:    9    k
# 90:    9    l
# 91:   10   NA
# 92:   10    m
# 93:   10    n
# 94:   10    o
# 95:   10    p
# 96:   10    r
# 97:   10    s
# 98:   10    t
# 99:   10    u
# 100:   10    v
# key1 key2

Does this always happen, or will I run into problems if I always assume this is true? 这总是会发生,或者如果我一直认为这是真的,我会遇到问题吗?

For setkey() , data.table behaves like base R sort(x, na.last=FALSE) , as the sort order (always increasing) is essential for binary search based joins/subsets. 对于setkey()data.table的行为类似于基本R sort(x, na.last=FALSE) ,因为排序顺序(总是增加)对于基于二进制搜索的连接/子集是必不可少的。 Rationale for NA s appearing first is that: NA首先出现的理由是:

"NAs are internally large negative number[s]" github.com/Rdatatable/data.table/issues/434 “NAs是内部大负数[s]” github.com/Rdatatable/data.table/issues/434


Miscellaneous comments: If you are just looking to reorder your data, you should consider setorder() , which is capable of sorting in any order and positioning NA s in the beginning or end. 杂项评论:如果您只是想重新排序数据,您应该考虑setorder() ,它能够按任意顺序排序并在开头或结尾定位NA

By the way, the standard syntax there is dt[sample(1:100, 10), key2 := NA] and you should watch out for mistaking the two-character string "NA" for NA (not a problem in your example). 顺便说一句,标准语法有dt[sample(1:100, 10), key2 := NA]你应该注意误将两个字符"NA"NA (不是在你的例子有问题) 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM