简体   繁体   English

为什么data.table组的不同取决于我是否直接传递变量名?

[英]Why does data.table group differently depending on whether I pass it the variable name directly or not?

If I pass the variable bloodpressure to data.table, everything works fine. 如果我将变量bloodpressure传递给data.table,一切正常。

tdt <- data.table(bloodpressure = rnorm(1000, mean=100, sd=15), male=rep(c(0,1)))
strata.var <- with(tdt, get(c('male')))

tdt[,list(
            varname='bloodpressure',
            N=.N,
            mean=mean(bloodpressure, na.rm=TRUE),
            sd=sd(bloodpressure, na.rm=TRUE)
            ),
        by=(strata.var)]

I get this result 我得到了这个结果

   strata.var       varname   N     mean       sd
1:          0 bloodpressure 500 100.2821 15.13686
2:          1 bloodpressure 500 100.0392 15.02566

Which matches the group means 哪个匹配组意味着什么

> mean(tdt$bloodpressure[tdt$male==0])
[1] 100.2821
> mean(tdt$bloodpressure[tdt$male==1])
[1] 100.0392

But if I am trying to do this programmatically, and the variable is stored in another variable ( var ) 但是,如果我尝试以编程方式执行此操作,并将变量存储在另一个变量( var )中

var_as_string <- 'bloodpressure'
var <- with(tdt, get(var_as_string))

tdt[,list(
            varname='bloodpressure',
            N=.N,
            mean=mean(var, na.rm=TRUE),
            sd=sd(bloodpressure, na.rm=TRUE)
            ),
        by=(strata.var)]

I get a different result. 我得到了不同的结果。

   strata.var       varname   N     mean       sd
1:          0 bloodpressure 500 100.1606 15.13686
2:          1 bloodpressure 500 100.1606 15.02566

Notice now mean is identical (ie calculated across the whole sample not by group. 现在注意mean是相同的(即在整个样本中计算而不是按组计算。

> mean(tdt$bloodpressure)
[1] 100.1606

You can replace mean=mean(var, na.rm=TRUE), with mean=mean(get(var_as_string), na.rm=TRUE) and then it should work - otherwise it just uses the numeric vector in var rather than the data table column you want it to use (and returns mean(var) for both subgroups). 您可以使用mean=mean(get(var_as_string), na.rm=TRUE)替换mean=mean(var, na.rm=TRUE),然后它应该工作 - 否则它只使用var的数字向量而不是您希望它使用的数据表列(并返回两个子组的mean(var) )。

library(data.table)
set.seed(1)
tdt <- data.table(bloodpressure = rnorm(1000, mean=100, sd=15), male=rep(c(0,1)))
strata.var <- with(tdt, get(c('male')))

tdt[,list(
        varname='bloodpressure',
        N=.N,
        mean=mean(bloodpressure, na.rm=TRUE),
        sd=sd(bloodpressure, na.rm=TRUE)
        ),
    by=(strata.var)]        
#   strata.var       varname   N      mean       sd
#1:          0 bloodpressure 500  99.58425 15.55735
#2:          1 bloodpressure 500 100.06630 15.50188

var_as_string <- 'bloodpressure'

tdt[,list(
        varname='bloodpressure',
        N=.N,
        mean=mean(get(var_as_string), na.rm=TRUE),
        sd=sd(bloodpressure, na.rm=TRUE)
        ),
    by=(strata.var)]                
#   strata.var       varname   N      mean       sd
#1:          0 bloodpressure 500  99.58425 15.55735
#2:          1 bloodpressure 500 100.06630 15.50188

OK. 好。 With much help from this excellent post , I think I have an answer ... 很大帮助这个优秀的帖子 ,我想我有一个答案...

colVars <- c('bloodpressure')
byCols <- c('male')
tdt[, lapply(.SD, function(x) mean=mean(x)), .SDcols = colVars, by=byCols]
tdt[, list(
    mean = lapply(.SD, function(x) mean(x)),
    sd = lapply(.SD, function(x) sd(x))
    ), .SDcols = colVars, by=byCols]

The trick is to use .SD , .SDcols , and the to wrap everything in lapply . 诀窍是使用.SD.SDcols ,并将所有内容包装在lapply

Why is it, that, despite searching for ages, it is only after spending a another block of time crafting a question that I manage to find the answer? 为什么,尽管寻找年龄,但只有在花费了另一段时间来制作一个我设法找到答案的问题之后呢? A question for https://meta.stackoverflow.com/ ... https://meta.stackoverflow.com/的问题...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R:为什么我的热量 map 看起来不同,这取决于我是否先对数据进行排序? - R: Why does my heat map look differently depending on whether I sort my data first? data.table:为什么不能总是直接传递列名? - data.table: why is it not always possible to pass column names directly? 为什么inner_join对data.table的行为有所不同? - Why does inner_join behave differently for data.table? 将变量名称作为data.table中的参数传递 - Pass variable name as argument inside data.table 神秘:为什么我添加和减去另一个变量时,data.table中的as.character()函数运行得更快? - Mystery: Why does the as.character() function in a data.table run faster if I add and subtract another variable? 为什么 data.table 通过引用更新名称(DT),即使我分配给另一个变量? - Why does data.table update names(DT) by reference, even if I assign to another variable? 为什么在指定文件名不同时data.table :: fread读取文件需要更多时间? - Why does it take more time for data.table::fread to read a file when filename is specified differently? 按变量连续性对数据分组 - Group data.table by continuity of variable 用于将data.table中的连续变量分组的功能 - Function to group continuous variable in data.table 在data.table group by子句中使用变量 - Using variable in data.table group by clause
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM