简体   繁体   English

summaryBy 和很多变量

[英]summaryBy and lots of variables

I want to use summaryBy and use three grouping variables (right side of my formula), but 170 or so variables to be summarised (in my case calculate median).我想使用 summaryBy 并使用三个分组变量(公式的右侧),但要汇总 170 个左右的变量(在我的情况下计算中位数)。 How can I specify them all in the same formula?如何在同一个公式中指定它们?

Instead of typing out而不是打字

var1+var2+var3...

etc. I thought I could make a string like that.等等。我想我可以做一个这样的字符串。 That was a whole project in itself, but at least I now have a string stored that is all the variables with plus signs in between.这本身就是一个完整的项目,但至少我现在存储了一个字符串,其中包含中间带有加号的所有变量。 I call it z1 .我称之为z1

Now, simply asking for z1 or even paste(z1) in my summaryBy script does not work:现在,在我的 summaryBy 脚本中简单地要求z1甚至paste(z1)不起作用:

d <- summaryBy(paste(z1) ~ year + month + ID,
                data=.., 
                FUN=c(median,sum), 
                na.rm=TRUE)

Giving error:给出错误:

Error in tapply(currVAR, rh.string.factor, function(x) { : tapply(currVAR, rh.string.factor, function(x) { 中的错误:
arguments must have same length参数必须具有相同的长度

I imagine it has to do with the fact that in summaryBy I specify my data.我想这与总结我指定我的数据的事实有关。 But I am new to R and therefore am not able to comprehend the problem beyond this.但我是 R 的新手,因此无法理解除此之外的问题。

I also tried a different method, as suggested:我也尝试了不同的方法,如建议的那样:

d<-summaryBy(paste(z1,"~year+month+ID"),
                data=..,
                FUN=c(median,sum),
                na.rm=TRUE)

This instead gives the error这反而给出了错误

Error in .get_variables(formula, data, id, debug.info) : 'formula' must be a formula or a list .get_variables(formula, data, id, debug.info) 中的错误:“formula”必须是公式或列表

So not sure how to go form there.所以不确定如何去那里。

From the help documentation:从帮助文档:

Computations on several variables is done using cbind( ) summaryBy(cbind(Weight, Feed) ~ Evit + Cu, data=subset(dietox, Time > 1), FUN=fun)使用 cbind( ) summaryBy(cbind(Weight, Feed) ~ Evit + Cu, data=subset(dietox, Time > 1), FUN=fun) 完成对几个变量的计算

And testing this, this time with z2 being a string of all my variables separated by commas.并对此进行测试,这次 z2 是由逗号分隔的所有变量的字符串。

d<-summaryBy(cbind(z2)~year+month+ID,
                data=..,
                FUN=c(median,sum),
                na.rm=TRUE)

or the variation或变异

d<-summaryBy(cbind(paste(z2))~year+month+ID,
                data=..,
                FUN=c(median,sum),
                na.rm=TRUE)

Both give the argument length error as my original try above.两者都给出了我在上面的原始尝试中的参数长度错误。

Another suggestion (thanks @akrun):另一个建议(感谢@akrun):

d<-summaryBy(as.formula(paste(z1,"~year+month+ID")),
                data=..,
                FUN=c(median,sum),
                na.rm=TRUE)'

Reminder: z1 is variables with pluses in between.提醒: z1是中间有加号的变量。

In this case, R gives no error.在这种情况下,R 不会出错。 It seems like it is either loading or wating for additional commands.它似乎正在加载或等待其他命令。 Console looks like this: Screenshot of console Without the > at the bottom.. What does that mean?控制台如下所示:控制台屏幕截图底部没有> .. 那是什么意思?

Final edit and solution:最终编辑和解决方案:

The as.formula approach worked! as.formula方法奏效了! Thanks so much!非常感谢! I now understand that if console does not have an arrow at the bottom, like in my screenshot above, it means R is computing haha.我现在明白,如果控制台底部没有箭头,就像我上面的截图一样,这意味着 R 正在计算哈哈。

The issue is that paste is just wrapping around only the variables of interest.问题是paste只是围绕感兴趣的变量。 It can be有可能

library(doBy)
summaryBy(as.formula(paste(z1, "~ year + month + ID")),
            data=.., 
            FUN=c(median,sum), 
            na.rm=TRUE)

where在哪里

z1 <- paste0('var', 1:3, collapse=" + ")

Using a reproducible example from ?summaryBy使用来自?summaryBy的可重现示例

data(dietox)
dietox12    <- subset(dietox,Time==12)
fun <- function(x){
   c(m=mean(x), v=var(x), n=length(x))
 }

out1 <-  summaryBy(cbind(Weight, Feed) ~ Evit + Cu, data=dietox12,
       FUN=fun)

out2 <-  summaryBy(Weight +  Feed ~ Evit + Cu, data=dietox12,
                      FUN=fun)

z2 <- paste(c("Weight", "Feed"), collapse=" + ")
out3 <- summaryBy(as.formula(paste(z2,  "~ Evit + Cu")), data=dietox12,
       FUN=fun)
identical(out1, out2)
#[1] TRUE
identical(out1, out3)
#[1] TRUE

So, thanks to @akrun, the following code now works:所以,感谢@akrun,下面的代码现在可以工作了:

d<-summaryBy(as.formula(paste(z1,"~year+month+ID")),
              data=..,
              FUN=c(median,sum),
              na.rm=TRUE)

The reason I thought it didn't at first is because it took so long to compute!我一开始认为没有的原因是因为计算时间太长了! It is a massive dataset after all.毕竟这是一个庞大的数据集。 Edited my original post but left all my tries in there, including the question about the "missing arrow" which I now understand to mean that R is working.编辑了我的原始帖子,但将我所有的尝试都留在了那里,包括关于“缺少箭头”的问题,我现在明白这意味着 R 正在工作。 Hard.难的。 Thanks!谢谢!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM