[英]How does data.table sort NA values on key columns?
Using data.table, say I'm setting the key using two columns, and one of the columns has missing values. 使用data.table,假设我使用两列设置密钥,其中一列缺少值。 Data table seems to sort the
NA
values to the first values. 数据表似乎将
NA
值排序为第一个值。
require(data.table)
set.seed(919)
# Create sample data
dt <- data.table(
key1 = rep(1:10, each = 10),
key2 = rep_len(letters, 100)
)
# Set some key2 values to missing
dt[sample(1:100, 10), "key2"] <- NA
# Set key (sort)
setkeyv(dt, c("key1", "key2"))
dt
# 1: 1 NA
# 2: 1 a
# 3: 1 b
# 4: 1 c
# 5: 1 d
# 6: 1 f
# 7: 1 g
# 8: 1 h
# 9: 1 i
# 10: 1 j
# 11: 2 NA
# 12: 2 NA
# 13: 2 k
# 14: 2 m
# 15: 2 n
# 16: 2 o
# 17: 2 p
# 18: 2 q
# 19: 2 r
# 20: 2 s
# 21: 3 a
# 22: 3 b
# 23: 3 c
# 24: 3 d
# 25: 3 u
# 26: 3 v
# 27: 3 w
# 28: 3 x
# 29: 3 y
# 30: 3 z
# 31: 4 e
# 32: 4 f
# 33: 4 g
# 34: 4 h
# 35: 4 i
# 36: 4 j
# 37: 4 k
# 38: 4 l
# 39: 4 m
# 40: 4 n
# 41: 5 NA
# 42: 5 NA
# 43: 5 o
# 44: 5 q
# 45: 5 r
# 46: 5 s
# 47: 5 u
# 48: 5 v
# 49: 5 w
# 50: 5 x
# 51: 6 NA
# 52: 6 a
# 53: 6 b
# 54: 6 c
# 55: 6 d
# 56: 6 e
# 57: 6 g
# 58: 6 h
# 59: 6 y
# 60: 6 z
# 61: 7 i
# 62: 7 j
# 63: 7 k
# 64: 7 l
# 65: 7 m
# 66: 7 n
# 67: 7 o
# 68: 7 p
# 69: 7 q
# 70: 7 r
# 71: 8 NA
# 72: 8 NA
# 73: 8 a
# 74: 8 b
# 75: 8 t
# 76: 8 u
# 77: 8 w
# 78: 8 x
# 79: 8 y
# 80: 8 z
# 81: 9 NA
# 82: 9 c
# 83: 9 d
# 84: 9 e
# 85: 9 f
# 86: 9 h
# 87: 9 i
# 88: 9 j
# 89: 9 k
# 90: 9 l
# 91: 10 NA
# 92: 10 m
# 93: 10 n
# 94: 10 o
# 95: 10 p
# 96: 10 r
# 97: 10 s
# 98: 10 t
# 99: 10 u
# 100: 10 v
# key1 key2
Does this always happen, or will I run into problems if I always assume this is true? 这总是会发生,或者如果我一直认为这是真的,我会遇到问题吗?
For setkey()
, data.table behaves like base R sort(x, na.last=FALSE)
, as the sort order (always increasing) is essential for binary search based joins/subsets. 对于
setkey()
, data.table的行为类似于基本R sort(x, na.last=FALSE)
,因为排序顺序(总是增加)对于基于二进制搜索的连接/子集是必不可少的。 Rationale for NA
s appearing first is that: NA
首先出现的理由是:
"NAs are internally large negative number[s]" github.com/Rdatatable/data.table/issues/434
“NAs是内部大负数[s]” github.com/Rdatatable/data.table/issues/434
Miscellaneous comments: If you are just looking to reorder your data, you should consider setorder()
, which is capable of sorting in any order and positioning NA
s in the beginning or end. 杂项评论:如果您只是想重新排序数据,您应该考虑
setorder()
,它能够按任意顺序排序并在开头或结尾定位NA
。
By the way, the standard syntax there is dt[sample(1:100, 10), key2 := NA]
and you should watch out for mistaking the two-character string "NA"
for NA
(not a problem in your example). 顺便说一句,标准语法有
dt[sample(1:100, 10), key2 := NA]
你应该注意误将两个字符"NA"
的NA
(不是在你的例子有问题) 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.