简体   繁体   中英

One-way ANOVA for stratified samples in R

I have a stratified sample with three groups ("a","b","c") that where drawn from a larger population N. All groups have 30 observations but their proportions in N are unequal, hence their sampling weights differ.

I use the survey package in R to calculate summary statistics and linear regression models and would like to know how to calculate a one-way ANOVA correcting for the survey design (if necessary).

My assumption is and please correct me if I'm wrong, that the standard error for the variance should be normally higher for a population where the weight is smaller, hence a simple ANOVA that does not account for the survey design should not be reliable.

Here is an example. Any help would be appreciated.

## Oneway- ANOVA tests in R for surveys with stratified sampling-design
library("survey")
# create test data
test.df<-data.frame(
  id=1:90,
  variable=c(rnorm(n = 30,mean=150,sd=10),
             rnorm(n = 30,mean=150,sd=10),
             rnorm(n = 30,mean=140,sd=10)),
  groups=c(rep("a",30),
  rep("b",30),
  rep("c",30)),
  weights=c(rep(1,30), # undersampled
  rep(1,30),
  rep(100,30))) # oversampled


# correct for survey design
test.df.survey<-svydesign(id=~id,
                           strata=~groups,
                           weights=~weights,
                           data=test.df)

## descriptive statistics
# boxplot
svyboxplot(~variable~groups,test.df.survey)
# means
svyby(~variable,~groups,test.df.survey,svymean)
# variances
svyby(~variable,~groups,test.df.survey,svyvar)


### ANOVA ###
## One-way ANOVA without correcting for survey design
summary(aov(formula = variable~groups,data = test.df))

Hmm this is a interesting question, as far as I know it'd be difficult to consider weights in one-way anova. Thus I decided to show you the way that I'd solve this problem.

I'm going to use two-way anova and then soem port hoc test.

First of all let's build a linear model based on your data and check how does it look like.

library(car)
library(agricolae)
model.lm = lm(variable ~ groups * weights, data = test.df)
shapiro.test(resid(model.lm))

Shapiro-Wilk normality test

data:  resid(model.lm)
W = 0.98238, p-value = 0.263

leveneTest(variable ~ groups * factor(weights), data = test.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value  Pr(>F)  
group  2  2.6422 0.07692 .
      87                  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Distribution is close to normal, variances differ between groups, so the variance isn't homogeneic - should be for parametrical test - anova. However let's perform the test anyway.

Several plots to check that our data fits to this test:

hist(resid(model.lm))
plot(model.lm)

很正常 在此输入图像描述 在此输入图像描述 在此输入图像描述 在此输入图像描述

Here is interpretation of plots, they don't look bad actually.

Let's run two-way anova:

anova(model.lm)
Analysis of Variance Table

Response: variable
          Df Sum Sq Mean Sq F value    Pr(>F)    
groups     2 2267.8 1133.88  9.9566 0.0001277 ***
Residuals 87 9907.8  113.88                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As you see, the results are very close to yours. Some post hoc test:

(result.hsd = HSD.test(model.lm, list('groups', 'weights')))
$statistics
   MSerror Df     Mean     CV      MSD
  113.8831 87 147.8164 7.2195 6.570186

$parameters
   test         name.t ntr StudentizedRange alpha
  Tukey groups:weights   3         3.372163  0.05

$means
      variable       std  r      Min      Max      Q25      Q50      Q75
a:1   150.8601 11.571185 30 113.3240 173.0429 145.2710 151.9689 157.8051
b:1   151.8486  8.330029 30 137.1907 176.9833 147.8404 150.3161 154.7321
c:100 140.7404 11.762979 30 118.0823 163.9753 131.6112 141.1810 147.8231

$comparison
NULL

$groups
      variable groups
b:1   151.8486      a
a:1   150.8601      a
c:100 140.7404      b

attr(,"class")
[1] "group"

And maybe some different way:

aov_cont<- aov(test.df$variable ~ test.df$groups * test.df$weights)
summary(aov_cont)
               Df Sum Sq Mean Sq F value   Pr(>F)    
test.df$groups  2   2268  1133.9   9.957 0.000128 ***
Residuals      87   9908   113.9                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(TukeyHSD(aov_cont))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = test.df$variable ~ test.df$groups * test.df$weights)

$`test.df$groups`
           diff        lwr       upr     p adj
b-a   0.9884608  -5.581725  7.558647 0.9315792
c-a -10.1197048 -16.689891 -3.549519 0.0011934
c-b -11.1081657 -17.678352 -4.537980 0.0003461

Summarizing, the results are very close to yours. Personaly I'll run two way anova with (*) symbol or (+) when you are sure that your variables are independent - additive model.

Group c with bigger weight differs from groups a and b substantially.

According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs ) in the 70's.

A normal linear regression model yields practically the same results as an ANOVA , but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA ... simply use a GLM instead.

summary(svyglm(variable~groups,test.df.survey))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM