[英]Count logical values in subset of columns using .SDcols argument to data.table
I have a data.table
of logical values as follows: 我有一个逻辑值的data.table
,如下所示:
library(data.table)
set.seed(1)
myDt <- data.table(id = paste0("id", 1:10))
myDt[, paste0(letters[1:3], sample(1:10, 9, replace = FALSE)) :=
lapply(1:9, function(i) sample(c(TRUE, FALSE), 10, replace = TRUE))]
myDt
id a3 b4 c5 a7 b2 c8 a9 b6 c10
1: id1 TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
2: id2 TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
3: id3 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
4: id4 FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
5: id5 TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
6: id6 FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
7: id7 TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
8: id8 FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
9: id9 FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
10: id10 TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
The columns apart from id
are three categories ( a
, b
and c
) each with 3 replicates (integer). 除id
以外的列是三个类别( a
, b
和c
),每个类别具有3个重复项(整数)。 I need to count the logical values for each category without knowing the replicate numbers in advance. 我需要计算每个类别的逻辑值,而无需事先知道重复编号。
I can get the columns for category a
as follows: 我可以按以下方式获得类别a
的列:
aCols <- grep("^a", names(myDt), value = TRUE)
myDt[, .SD, .SDcols = aCols, by = id]
id a3 a7 a9
1: id1 TRUE TRUE FALSE
2: id2 TRUE FALSE TRUE
3: id3 TRUE FALSE FALSE
4: id4 FALSE FALSE TRUE
5: id5 TRUE FALSE TRUE
6: id6 FALSE FALSE TRUE
7: id7 TRUE FALSE FALSE
8: id8 FALSE TRUE FALSE
9: id9 FALSE TRUE TRUE
10: id10 TRUE FALSE FALSE
but then I'm stuck when trying to count the logical values. 但是当我尝试计算逻辑值时,我陷入了困境。 So far I've tried: 到目前为止,我已经尝试过:
myDt[, sum(.SD), .SDcols = aCols, by = id]
Error in gsum(.SD) :
GForce sum can only be applied to columns, not .SD or similar. To sum all items in a list such as .SD, either add the prefix base::sum(.SD) or turn off GForce optimization using options(datatable.optimize=1). More likely, you may be looking for 'DT[,lappy(.SD,sum),by=,.SDcols=]'
and 和
myDt[, base::sum(.SD), .SDcols = aCols, by = id]
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
I did try the latter code with numerics instead of logicals and it gave me the expected result. 我确实尝试用数字而不是逻辑使用后一种代码,它给了我预期的结果。
I'd appreciate any suggestions. 我将不胜感激任何建议。 Thanks for reading! 谢谢阅读!
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
[4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.4
loaded via a namespace (and not attached):
[1] magrittr_1.5 plyr_1.8.3 tools_3.2.2 reshape2_1.4.1 Rcpp_0.12.0 stringi_0.5-5
[7] stringr_1.0.0 chron_2.3-47
When you have many columns of same type and you want to operate on them at once, it is usually better to tide up your data and the spread it again. 当您有许多相同类型的列并且想要一次对其进行操作时,通常最好整理一下数据并再次散布。 Here's a possible solution using melt
and dcast
combination 这是使用melt
和dcast
结合的可能解决方案
# melt by the "id" column
res <- melt(myDt, id = "id")
# Remove numeric values from column names
res[, indx := sub("\\d+", "", variable)]
# Spread the data again according to the new index while counting `TRUE`s
dcast(res, id ~ indx, value.var = "value", fun.aggregate = function(x) sum(x == "TRUE"))
# id a b c
# 1: id1 2 0 3
# 2: id10 1 1 1
# 3: id2 2 2 2
# 4: id3 1 1 2
# 5: id4 1 2 2
# 6: id5 2 3 2
# 7: id6 1 2 0
# 8: id7 1 3 1
# 9: id8 1 2 2
# 10: id9 2 2 2
I've used the development version here (v 1.9.5) , you may need to use dcast.data.table
instead of just dcast
if you using v 1.9.4 我在这里使用开发版本(v 1.9.5) ,如果使用v 1.9.4,则可能需要使用dcast.data.table
而不是dcast
Also, you mentioned you have logical values, but your example contained character values ( sample(c("TRUE", "FALSE"), 10, replace = TRUE))
instead of just sample(c(TRUE, FALSE), 10, replace = TRUE))
), if your real data set truly have logical values, then the last step could be simplified to just 另外,您提到您具有逻辑值,但是示例包含字符值( sample(c("TRUE", "FALSE"), 10, replace = TRUE))
而不仅仅是sample(c(TRUE, FALSE), 10, replace = TRUE))
),如果您的真实数据集确实具有逻辑值,则可以将最后一步简化为
dcast(res, id ~ indx, value.var = "value", sum)
I like @David Arenburg 's answer. 我喜欢@David Arenburg的答案。 Just to add another option -- use rowSums()
instead of sum()
. 只是添加另一个选项-使用rowSums()
而不是sum()
。 With your updated data, use 使用更新的数据,使用
myDt[, a_cols := rowSums(.SD), .SDcols = aCols]
myDt
id a3 b4 c5 a7 b2 c8 a9 b6 c10 a_cols
1: id1 TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE 2
2: id2 TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE 2
3: id3 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE 1
4: id4 FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE 1
5: id5 TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE 2
6: id6 FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE 1
7: id7 TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE 1
8: id8 FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE 1
9: id9 FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE 2
10: id10 TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.