简体   繁体   English

使用data.table的.SDcols参数计算列子集中的逻辑值

[英]Count logical values in subset of columns using .SDcols argument to data.table

I have a data.table of logical values as follows: 我有一个逻辑值的data.table ,如下所示:

library(data.table)
set.seed(1)
myDt <- data.table(id = paste0("id", 1:10))
myDt[, paste0(letters[1:3], sample(1:10, 9, replace = FALSE)) :=
       lapply(1:9, function(i) sample(c(TRUE, FALSE), 10, replace = TRUE))]
myDt
      id    a3    b4    c5    a7    b2    c8    a9    b6   c10
 1:  id1  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE
 2:  id2  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
 3:  id3  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
 4:  id4 FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 5:  id5  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 6:  id6 FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
 7:  id7  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE
 8:  id8 FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
 9:  id9 FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
10: id10  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

The columns apart from id are three categories ( a , b and c ) each with 3 replicates (integer). id以外的列是三个类别( abc ),每个类别具有3个重复项(整数)。 I need to count the logical values for each category without knowing the replicate numbers in advance. 我需要计算每个类别的逻辑值,而无需事先知道重复编号。

I can get the columns for category a as follows: 我可以按以下方式获得类别a的列:

aCols <- grep("^a", names(myDt), value = TRUE)
myDt[, .SD, .SDcols = aCols, by = id]
      id    a3    a7    a9
 1:  id1  TRUE  TRUE FALSE
 2:  id2  TRUE FALSE  TRUE
 3:  id3  TRUE FALSE FALSE
 4:  id4 FALSE FALSE  TRUE
 5:  id5  TRUE FALSE  TRUE
 6:  id6 FALSE FALSE  TRUE
 7:  id7  TRUE FALSE FALSE
 8:  id8 FALSE  TRUE FALSE
 9:  id9 FALSE  TRUE  TRUE
10: id10  TRUE FALSE FALSE

but then I'm stuck when trying to count the logical values. 但是当我尝试计算逻辑值时,我陷入了困境。 So far I've tried: 到目前为止,我已经尝试过:

myDt[, sum(.SD), .SDcols = aCols, by = id]
Error in gsum(.SD) : 
  GForce sum can only be applied to columns, not .SD or similar. To sum all items in a list such as .SD, either add the prefix base::sum(.SD) or turn off GForce optimization using options(datatable.optimize=1). More likely, you may be looking for 'DT[,lappy(.SD,sum),by=,.SDcols=]'

and

myDt[, base::sum(.SD), .SDcols = aCols, by = id]
Error in FUN(X[[i]], ...) : 
  only defined on a data frame with all numeric variables

I did try the latter code with numerics instead of logicals and it gave me the expected result. 我确实尝试用数字而不是逻辑使用后一种代码,它给了我预期的结果。

I'd appreciate any suggestions. 我将不胜感激任何建议。 Thanks for reading! 谢谢阅读!

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.4

loaded via a namespace (and not attached):
[1] magrittr_1.5   plyr_1.8.3     tools_3.2.2    reshape2_1.4.1 Rcpp_0.12.0    stringi_0.5-5 
[7] stringr_1.0.0  chron_2.3-47  

When you have many columns of same type and you want to operate on them at once, it is usually better to tide up your data and the spread it again. 当您有许多相同类型的列并且想要一次对其进行操作时,通常最好整理一下数据并再次散布。 Here's a possible solution using melt and dcast combination 这是使用meltdcast结合的可能解决方案

# melt by the "id" column
res <- melt(myDt, id = "id") 
# Remove numeric values from column names
res[, indx := sub("\\d+", "", variable)] 
# Spread the data again according to the new index while counting `TRUE`s
dcast(res, id ~ indx, value.var = "value", fun.aggregate = function(x) sum(x == "TRUE"))
#       id a b c
#  1:  id1 2 0 3
#  2: id10 1 1 1
#  3:  id2 2 2 2
#  4:  id3 1 1 2
#  5:  id4 1 2 2
#  6:  id5 2 3 2
#  7:  id6 1 2 0
#  8:  id7 1 3 1
#  9:  id8 1 2 2
# 10:  id9 2 2 2

I've used the development version here (v 1.9.5) , you may need to use dcast.data.table instead of just dcast if you using v 1.9.4 我在这里使用开发版本(v 1.9.5) ,如果使用v 1.9.4,则可能需要使用dcast.data.table而不是dcast


Also, you mentioned you have logical values, but your example contained character values ( sample(c("TRUE", "FALSE"), 10, replace = TRUE)) instead of just sample(c(TRUE, FALSE), 10, replace = TRUE)) ), if your real data set truly have logical values, then the last step could be simplified to just 另外,您提到您具有逻辑值,但是示例包含字符值( sample(c("TRUE", "FALSE"), 10, replace = TRUE))而不仅仅是sample(c(TRUE, FALSE), 10, replace = TRUE)) ),如果您的真实数据集确实具有逻辑值,则可以将最后一步简化为

dcast(res, id ~ indx, value.var = "value", sum)

I like @David Arenburg 's answer. 我喜欢@David Arenburg的答案。 Just to add another option -- use rowSums() instead of sum() . 只是添加另一个选项-使用rowSums()而不是sum() With your updated data, use 使用更新的数据,使用

myDt[, a_cols := rowSums(.SD), .SDcols = aCols]
myDt
          id    a3    b4    c5    a7    b2    c8    a9    b6   c10 a_cols
     1:  id1  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE      2
     2:  id2  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE      2
     3:  id3  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE      1
     4:  id4 FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE      1
     5:  id5  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE      2
     6:  id6 FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE      1
     7:  id7  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE      1
     8:  id8 FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE      1
     9:  id9 FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE      2
    10: id10  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM