[英]Why does data.table group differently depending on whether I pass it the variable name directly or not?
If I pass the variable bloodpressure
to data.table, everything works fine. 如果我将变量
bloodpressure
传递给data.table,一切正常。
tdt <- data.table(bloodpressure = rnorm(1000, mean=100, sd=15), male=rep(c(0,1)))
strata.var <- with(tdt, get(c('male')))
tdt[,list(
varname='bloodpressure',
N=.N,
mean=mean(bloodpressure, na.rm=TRUE),
sd=sd(bloodpressure, na.rm=TRUE)
),
by=(strata.var)]
I get this result 我得到了这个结果
strata.var varname N mean sd
1: 0 bloodpressure 500 100.2821 15.13686
2: 1 bloodpressure 500 100.0392 15.02566
Which matches the group means 哪个匹配组意味着什么
> mean(tdt$bloodpressure[tdt$male==0])
[1] 100.2821
> mean(tdt$bloodpressure[tdt$male==1])
[1] 100.0392
But if I am trying to do this programmatically, and the variable is stored in another variable ( var
) 但是,如果我尝试以编程方式执行此操作,并将变量存储在另一个变量(
var
)中
var_as_string <- 'bloodpressure'
var <- with(tdt, get(var_as_string))
tdt[,list(
varname='bloodpressure',
N=.N,
mean=mean(var, na.rm=TRUE),
sd=sd(bloodpressure, na.rm=TRUE)
),
by=(strata.var)]
I get a different result. 我得到了不同的结果。
strata.var varname N mean sd
1: 0 bloodpressure 500 100.1606 15.13686
2: 1 bloodpressure 500 100.1606 15.02566
Notice now mean
is identical (ie calculated across the whole sample not by group. 现在注意
mean
是相同的(即在整个样本中计算而不是按组计算。
> mean(tdt$bloodpressure)
[1] 100.1606
You can replace mean=mean(var, na.rm=TRUE),
with mean=mean(get(var_as_string), na.rm=TRUE)
and then it should work - otherwise it just uses the numeric vector in var
rather than the data table column you want it to use (and returns mean(var)
for both subgroups). 您可以使用
mean=mean(get(var_as_string), na.rm=TRUE)
替换mean=mean(var, na.rm=TRUE),
然后它应该工作 - 否则它只使用var
的数字向量而不是您希望它使用的数据表列(并返回两个子组的mean(var)
)。
library(data.table)
set.seed(1)
tdt <- data.table(bloodpressure = rnorm(1000, mean=100, sd=15), male=rep(c(0,1)))
strata.var <- with(tdt, get(c('male')))
tdt[,list(
varname='bloodpressure',
N=.N,
mean=mean(bloodpressure, na.rm=TRUE),
sd=sd(bloodpressure, na.rm=TRUE)
),
by=(strata.var)]
# strata.var varname N mean sd
#1: 0 bloodpressure 500 99.58425 15.55735
#2: 1 bloodpressure 500 100.06630 15.50188
var_as_string <- 'bloodpressure'
tdt[,list(
varname='bloodpressure',
N=.N,
mean=mean(get(var_as_string), na.rm=TRUE),
sd=sd(bloodpressure, na.rm=TRUE)
),
by=(strata.var)]
# strata.var varname N mean sd
#1: 0 bloodpressure 500 99.58425 15.55735
#2: 1 bloodpressure 500 100.06630 15.50188
OK. 好。 With much help from this excellent post , I think I have an answer ...
从很大帮助这个优秀的帖子 ,我想我有一个答案...
colVars <- c('bloodpressure')
byCols <- c('male')
tdt[, lapply(.SD, function(x) mean=mean(x)), .SDcols = colVars, by=byCols]
tdt[, list(
mean = lapply(.SD, function(x) mean(x)),
sd = lapply(.SD, function(x) sd(x))
), .SDcols = colVars, by=byCols]
The trick is to use .SD
, .SDcols
, and the to wrap everything in lapply
. 诀窍是使用
.SD
, .SDcols
,并将所有内容包装在lapply
。
Why is it, that, despite searching for ages, it is only after spending a another block of time crafting a question that I manage to find the answer? 为什么,尽管寻找年龄,但只有在花费了另一段时间来制作一个我设法找到答案的问题之后呢? A question for https://meta.stackoverflow.com/ ...
https://meta.stackoverflow.com/的问题...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.