[英]In R using data.table, how does one exclude rows and how does one include NA values in an integer column
I am using data.table quite a lot. 我经常使用data.table。 It works well but I am finding it is taking me a long time to transition my syntax so that it takes advantage of the binary searching.
它运行良好,但我发现转换语法要花很长时间,因此它利用了二进制搜索的优势。
In the following data table how would 1 select all the rows, including where the CPT value is NA
but exclude rows where the CPT value is 23456 or 10000. 在下面的数据表中,如何1选择所有行,包括CPT值为
NA
行,但排除CPT值为23456或10000的行。
cpt <- c(23456,23456,10000,44555,44555,NA)
description <- c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy","miscellaneous procedure")
cpt.desc <- data.table(cpt,description)
setkey(cpt.desc,cpt)
The following line works but I think it uses the vector scan method instead of a binary search (or binary exclusion). 以下几行有效,但我认为它使用矢量扫描方法而不是二进制搜索(或二进制排除)。 Is there a way to to drop rows by binary methods?
有没有一种方法可以通过二进制方法删除行?
cpt.desc[!cpt %in% c(23456,10000),]
Only a partial answer, because I am new to data.table. 只是部分答案,因为我是data.table的新手。 A self-join works for number, but the same fails for strings.
自联接适用于数字,但字符串适用相同。 I am sure one of the professional data tablers knows what to do.
我确信其中一位专业数据表员知道该怎么做。
library(data.table)
n <- 1000000
cpt.desc <- data.table(
cpt=rep(c(23456,23456,10000,44555,44555,NA),n),
description=rep(c("tonsillectomy","tonsillectomy in >12 year old","brain transplant","castration","orchidectomy","miscellaneous procedure"),n))
# Added on revision. Not very elegant, though. Faster by factor of 3
# but probably better scaling
setkey(cpt.desc,cpt)
system.time(a<-cpt.desc[-cpt.desc[J(23456,45555),which=TRUE]])
system.time(b<-cpt.desc[!(cpt %in% c(23456,45555))] )
str(a)
str(b)
identical(as.data.frame(a),as.data.frame(b))
# A self-join works Ok with numbers
setkey(cpt.desc,cpt)
system.time(a<-cpt.desc[cpt %in% c(23456,45555),])
system.time(b<-cpt.desc[J(23456,45555)])
str(a)
str(b)
identical(as.data.frame(a),as.data.frame(b)[,-3])
# But the same failes with characters
setkey(cpt.desc,description)
system.time(a<-cpt.desc[description %in% c("castration","orchidectomy"),])
system.time(b<-cpt.desc[J("castration","orchidectomy"),])
identical(as.data.frame(a),as.data.frame(b)[,-3])
str(a)
str(b)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.