I have a stratified sample with three groups ("a","b","c") that where drawn from a larger population N. All groups have 30 observations but their proportions in N are unequal, hence their sampling weights differ.
I use the survey
package in R to calculate summary statistics and linear regression models and would like to know how to calculate a one-way ANOVA correcting for the survey design (if necessary).
My assumption is and please correct me if I'm wrong, that the standard error for the variance should be normally higher for a population where the weight is smaller, hence a simple ANOVA that does not account for the survey design should not be reliable.
Here is an example. Any help would be appreciated.
## Oneway- ANOVA tests in R for surveys with stratified sampling-design
library("survey")
# create test data
test.df<-data.frame(
id=1:90,
variable=c(rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=140,sd=10)),
groups=c(rep("a",30),
rep("b",30),
rep("c",30)),
weights=c(rep(1,30), # undersampled
rep(1,30),
rep(100,30))) # oversampled
# correct for survey design
test.df.survey<-svydesign(id=~id,
strata=~groups,
weights=~weights,
data=test.df)
## descriptive statistics
# boxplot
svyboxplot(~variable~groups,test.df.survey)
# means
svyby(~variable,~groups,test.df.survey,svymean)
# variances
svyby(~variable,~groups,test.df.survey,svyvar)
### ANOVA ###
## One-way ANOVA without correcting for survey design
summary(aov(formula = variable~groups,data = test.df))
Hmm this is a interesting question, as far as I know it'd be difficult to consider weights in one-way anova. Thus I decided to show you the way that I'd solve this problem.
I'm going to use two-way anova and then soem port hoc test.
First of all let's build a linear model based on your data and check how does it look like.
library(car)
library(agricolae)
model.lm = lm(variable ~ groups * weights, data = test.df)
shapiro.test(resid(model.lm))
Shapiro-Wilk normality test
data: resid(model.lm)
W = 0.98238, p-value = 0.263
leveneTest(variable ~ groups * factor(weights), data = test.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 2.6422 0.07692 .
87
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Distribution is close to normal, variances differ between groups, so the variance isn't homogeneic - should be for parametrical test - anova. However let's perform the test anyway.
Several plots to check that our data fits to this test:
hist(resid(model.lm))
plot(model.lm)
Here is interpretation of plots, they don't look bad actually.
Let's run two-way anova:
anova(model.lm)
Analysis of Variance Table
Response: variable
Df Sum Sq Mean Sq F value Pr(>F)
groups 2 2267.8 1133.88 9.9566 0.0001277 ***
Residuals 87 9907.8 113.88
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you see, the results are very close to yours. Some post hoc test:
(result.hsd = HSD.test(model.lm, list('groups', 'weights')))
$statistics
MSerror Df Mean CV MSD
113.8831 87 147.8164 7.2195 6.570186
$parameters
test name.t ntr StudentizedRange alpha
Tukey groups:weights 3 3.372163 0.05
$means
variable std r Min Max Q25 Q50 Q75
a:1 150.8601 11.571185 30 113.3240 173.0429 145.2710 151.9689 157.8051
b:1 151.8486 8.330029 30 137.1907 176.9833 147.8404 150.3161 154.7321
c:100 140.7404 11.762979 30 118.0823 163.9753 131.6112 141.1810 147.8231
$comparison
NULL
$groups
variable groups
b:1 151.8486 a
a:1 150.8601 a
c:100 140.7404 b
attr(,"class")
[1] "group"
And maybe some different way:
aov_cont<- aov(test.df$variable ~ test.df$groups * test.df$weights)
summary(aov_cont)
Df Sum Sq Mean Sq F value Pr(>F)
test.df$groups 2 2268 1133.9 9.957 0.000128 ***
Residuals 87 9908 113.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(TukeyHSD(aov_cont))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = test.df$variable ~ test.df$groups * test.df$weights)
$`test.df$groups`
diff lwr upr p adj
b-a 0.9884608 -5.581725 7.558647 0.9315792
c-a -10.1197048 -16.689891 -3.549519 0.0011934
c-b -11.1081657 -17.678352 -4.537980 0.0003461
Summarizing, the results are very close to yours. Personaly I'll run two way anova with (*) symbol or (+) when you are sure that your variables are independent - additive model.
Group c
with bigger weight differs from groups a
and b
substantially.
According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs ) in the 70's.
A normal linear regression model yields practically the same results as an ANOVA , but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey
package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA ... simply use a GLM instead.
summary(svyglm(variable~groups,test.df.survey))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.