I want to use summaryBy and use three grouping variables (right side of my formula), but 170 or so variables to be summarised (in my case calculate median). How can I specify them all in the same formula?
Instead of typing out
var1+var2+var3...
etc. I thought I could make a string like that. That was a whole project in itself, but at least I now have a string stored that is all the variables with plus signs in between. I call it z1
.
Now, simply asking for z1
or even paste(z1)
in my summaryBy script does not work:
d <- summaryBy(paste(z1) ~ year + month + ID,
data=..,
FUN=c(median,sum),
na.rm=TRUE)
Giving error:
Error in tapply(currVAR, rh.string.factor, function(x) { :
arguments must have same length
I imagine it has to do with the fact that in summaryBy I specify my data. But I am new to R and therefore am not able to comprehend the problem beyond this.
I also tried a different method, as suggested:
d<-summaryBy(paste(z1,"~year+month+ID"),
data=..,
FUN=c(median,sum),
na.rm=TRUE)
This instead gives the error
Error in .get_variables(formula, data, id, debug.info) : 'formula' must be a formula or a list
So not sure how to go form there.
From the help documentation:
Computations on several variables is done using cbind( ) summaryBy(cbind(Weight, Feed) ~ Evit + Cu, data=subset(dietox, Time > 1), FUN=fun)
And testing this, this time with z2 being a string of all my variables separated by commas.
d<-summaryBy(cbind(z2)~year+month+ID,
data=..,
FUN=c(median,sum),
na.rm=TRUE)
or the variation
d<-summaryBy(cbind(paste(z2))~year+month+ID,
data=..,
FUN=c(median,sum),
na.rm=TRUE)
Both give the argument length error as my original try above.
Another suggestion (thanks @akrun):
d<-summaryBy(as.formula(paste(z1,"~year+month+ID")),
data=..,
FUN=c(median,sum),
na.rm=TRUE)'
Reminder: z1
is variables with pluses in between.
In this case, R gives no error. It seems like it is either loading or wating for additional commands. Console looks like this: Screenshot of console Without the >
at the bottom.. What does that mean?
The as.formula
approach worked! Thanks so much! I now understand that if console does not have an arrow at the bottom, like in my screenshot above, it means R is computing haha.
The issue is that paste
is just wrapping around only the variables of interest. It can be
library(doBy)
summaryBy(as.formula(paste(z1, "~ year + month + ID")),
data=..,
FUN=c(median,sum),
na.rm=TRUE)
where
z1 <- paste0('var', 1:3, collapse=" + ")
Using a reproducible example from ?summaryBy
data(dietox)
dietox12 <- subset(dietox,Time==12)
fun <- function(x){
c(m=mean(x), v=var(x), n=length(x))
}
out1 <- summaryBy(cbind(Weight, Feed) ~ Evit + Cu, data=dietox12,
FUN=fun)
out2 <- summaryBy(Weight + Feed ~ Evit + Cu, data=dietox12,
FUN=fun)
z2 <- paste(c("Weight", "Feed"), collapse=" + ")
out3 <- summaryBy(as.formula(paste(z2, "~ Evit + Cu")), data=dietox12,
FUN=fun)
identical(out1, out2)
#[1] TRUE
identical(out1, out3)
#[1] TRUE
So, thanks to @akrun, the following code now works:
d<-summaryBy(as.formula(paste(z1,"~year+month+ID")),
data=..,
FUN=c(median,sum),
na.rm=TRUE)
The reason I thought it didn't at first is because it took so long to compute! It is a massive dataset after all. Edited my original post but left all my tries in there, including the question about the "missing arrow" which I now understand to mean that R is working. Hard. Thanks!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.