在使用data.table的R中，如何排除行以及如何在整数列中包含NA值

Question

我经常使用data.table。 它运行良好，但我发现转换语法要花很长时间，因此它利用了二进制搜索的优势。

在下面的数据表中，如何1选择所有行，包括CPT值为NA行，但排除CPT值为23456或10000的行。

cpt <- c(23456,23456,10000,44555,44555,NA)
description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy","miscellaneous procedure")
cpt.desc <- data.table(cpt,description)

setkey(cpt.desc,cpt)

以下几行有效，但我认为它使用矢量扫描方法而不是二进制搜索（或二进制排除）。 有没有一种方法可以通过二进制方法删除行？

cpt.desc[!cpt %in% c(23456,10000),]

Answer 1

只是部分答案，因为我是data.table的新手。 自联接适用于数字，但字符串适用相同。 我确信其中一位专业数据表员知道该怎么做。

library(data.table)

n <- 1000000
cpt.desc <- data.table(
  cpt=rep(c(23456,23456,10000,44555,44555,NA),n),
  description=rep(c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy","miscellaneous procedure"),n))

# Added on revision. Not very elegant, though. Faster by factor of 3
# but probably better scaling 
setkey(cpt.desc,cpt)
system.time(a<-cpt.desc[-cpt.desc[J(23456,45555),which=TRUE]])
system.time(b<-cpt.desc[!(cpt %in% c(23456,45555))] )
str(a)
str(b)

identical(as.data.frame(a),as.data.frame(b))

# A self-join works Ok with numbers
setkey(cpt.desc,cpt)
system.time(a<-cpt.desc[cpt %in% c(23456,45555),])
system.time(b<-cpt.desc[J(23456,45555)])
str(a)
str(b)

identical(as.data.frame(a),as.data.frame(b)[,-3])

# But the same failes with characters
setkey(cpt.desc,description)
system.time(a<-cpt.desc[description %in% c("castration","orchidectomy"),])
system.time(b<-cpt.desc[J("castration","orchidectomy"),])
identical(as.data.frame(a),as.data.frame(b)[,-3])

str(a)
str(b)

在使用data.table的R中，如何排除行以及如何在整数列中包含NA值

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-01-19 09:22:51

在使用data.table的R中，如何排除行以及如何在整数列中包含NA值

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-01-19 09:22:51

解决方案1
2 已采纳 2012-01-19 09:22:51