简体   繁体   中英

How to run ANOVA on a wide format data.frame?

I've been taught to run an ANOVA with the formula: aov(dependent variable~independent variable, dataset)

but I am struggling with how to run an ANOVA for a particular dataset because it is broken up into three columns that each contain a value. The three columns are designated newborn, adolescent and adult (which is hamster age) and the values within each column represent blood pressure values. I need to run a test to determine if there is a relationship between blood pressure and age.

This is what the data looks like in R:

> hamster
   Newborn adolescent adult
1      108        110   105
2      110        105   100
3       90        100    95
4       80         90    85
5      100        102    97
6      120        110   105
7      125        105   100
8      130        115   110
9      120        100    95
10     130        120   115
11     145        130   125
12     150        125   120
13     130        135   130
14     155        130   125
15     140        120   115

Confused because the dependent variable are those values ^ within each column

The first step is to rearrange your data so it's in a "long" format instead of a "wide" format. This can be done in base R using the reshape function, but it's much easier to use the gather function in the tidyr package:

library(tidyr)
result <- hampster %>%
  gather(age, bp) %>%
  aov(bp ~ age, .)

Using tidyr also gives us the pipe operator ( %>% ), which let's you chain commands together in a pretty way. By default, it works by taking the result of the previous function and inserting it as the first argument of the next function. In your aov function, we overrode this using the . operator to explicitly put the data set resulting from the gather function in as the 2nd argument.

R has a useful function called stack to convert your data format into the one needed for ANOVA.

aov(values ~ ind, stack(hamster))

# Call:
#
# aov(formula = values ~ ind, data = stack(hamster))
#
# Terms:
#                       ind Residuals
# Sum of Squares   1525.378 11429.867
# Deg. of Freedom         2        42
#
# Residual standard error: 16.49666
# Estimated effects may be unbalanced

Code to run a repeated measures analysis of variance with one within subject variable and no between subjects variables is as follows. Note that we use group_by() from the dplyr package to retain the hamster id number so we can use it as the error term in the ANOVA.

hamsterData <- "id   Newborn adolescent adult
1      108        110   105
2      110        105   100
3       90        100    95
4       80         90    85
5      100        102    97
6      120        110   105
7      125        105   100
8      130        115   110
9      120        100    95
10     130        120   115
11     145        130   125
12     150        125   120
13     130        135   130
14     155        130   125
15     140        120   115"

hamster <- read.table(text = hamsterData,header = TRUE )
library(tidyr)
library(dplyr)
result <- hamster %>% group_by(id) %>%
     gather(age,bp, Newborn,adolescent,adult)
result$age <- factor(result$age,levels=c("Newborn","adolescent","adult"))
options(contrasts=c("contr.sum","contr.poly"))
modelAOV <- aov(bp ~ age + Error(factor(id)),data = result)
summary(modelAOV)

...and the output:

> modelAOV <- aov(bp ~ age + Error(factor(id)),data = result)
> summary(modelAOV)

Error: factor(id)
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 14  10013   715.2               

Error: Within
          Df Sum Sq Mean Sq F value  Pr(>F)    
age        2   1525   762.7   15.07 3.6e-05 ***
Residuals 28   1417    50.6                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM