简体   繁体   English

R:数字列的单向Anova和成对事后检验(土耳其,舍菲或其他)

[英]R: One Way Anova and pairwise post hoc tests (Turkey, Scheffe or other) for numerical columns

I have three columns in the dataframe dune (below - bottom of the page) describing the % cover of marram grass recorded for three different sand dune ecosystems: 我在数据框沙丘中有三列(页面下方-页面底部),描述了针对三种不同沙丘生态系统记录的mar草的覆盖率:

(1) Restored; (1)恢复; (2) Degraded; (2)降级; and (3) Natural; (3)自然的;

I have performed two different One Way Anova tests (below) - test 1 and test 2 - to establish significant differences between ecosystems. 我进行了以下两种不同的单向方差分析(测试1和测试2),以建立生态系统之间的显着差异。 Test 1 clearly shows significant differences between ecosystems; 测试1清楚地表明了生态系统之间的显着差异; however, test 2 shows no significant differences. 但是,测试2显示没有显着差异。 The box plot's (below) show stark differences in variance between ecosystems. 箱形图(下)显示了生态系统之间方差的明显差异。

Afterwards, I melted the dataframe to produce a factorial column (ie, headed Ecosystem.Type) which is also the response variable. 之后,我融化了数据框以生成一个阶乘列(即,标题为Ecosystem.Type),它也是响应变量。 The idea is to apply a glm model (test 3 - below)to test with a One Way Anova; 这个想法是应用glm模型(下面的测试3-)来测试单向方差分析; however, this method was unsuccessful (please find the error message below). 但是,此方法不成功(请在下面找到错误消息)。

Problem 问题

I am confused whether my code to perform each One Way Anova test is correct and the correct procedure to perform post hoc tests (Turkey HSD, Scheffe or others) to distinguish pairs of ecosystems that are significantly different. 我感到困惑的是,我执行单向Anova测试的代码是否正确,以及执行事后测试(土耳其HSD,Scheffe或其他)的正确程序,以区分出显着不同的生态系统对。 If anyone can help, I would be deeply appreciative for your advice. 如果有人可以提供帮助,我将非常感谢您的建议。 Many thanks.... 非常感谢....

data(dune)

Test 1 测试1

dune.type.1<-aov(Natural~Restored+Degraded, data=dune)
summary.aov(dune.type.1, intercept=T)

               Df Sum Sq Mean Sq F value   Pr(>F)    
     (Intercept)  1  34694   34694 138.679 1.34e-09 ***
     Restored     1     94      94   0.375    0.548    
     Degraded     1    486     486   1.942    0.181    
     Residuals   17   4253     250                     
           ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Post-hoc test's 事后测试

    posthoc<-TukeyHSD(dune.type.1, conf.level=0.95)

    Error in TukeyHSD.aov(dune.type.1, conf.level = 0.95) : 

    no factors in the fitted model

    In addition: Warning messages:
    1: In replications(paste("~", xx), data = mf) :
       non-factors ignored: Restored
    2: In replications(paste("~", xx), data = mf) :
       non-factors ignored: Degraded

Test 2 测试2

     dune1<-aov(Restored~Natural, data=dune)
     dune2<-aov(Restored~Degraded, data=dune)
     dune3<-aov(Degraded~Natural, data=dune)

     summary(dune1)

                 Df Sum Sq Mean Sq F value Pr(>F)
     Natural      1     86   85.58   0.356  0.558
     Residuals   18   4325  240.26               

    summary(dune2)

                 Df Sum Sq Mean Sq F value Pr(>F)
     Degraded     1    160   159.7   0.676  0.422
     Residuals   18   4250   236.1               

     summary(dune3)

                 Df Sum Sq Mean Sq F value Pr(>F)
     Natural      1  168.5  168.49   2.318  0.145
     Residuals   18 1308.5   72.69   

Test 3 测试3

melt.dune<-melt(dune, measure.vars=c("Degraded", "Restored", "Natural"))


colnames(melt.dune)=c("Ecosystem.Type", "Percentage.cover")
melt.dune$Percentage.cover<-as.numeric(melt.dune$Percentage.cover)

glm.dune<-glm(Ecosystem.Type~Percentage.cover, data=melt.dune)
summary(glm.dune)

Error

glm.dune<-glm(Ecosystem.Type~Percentage.cover, data=melt.dune)
Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  : 
NA/NaN/Inf in 'y'
In addition: Warning messages:
1: In Ops.factor(y, mu) : ‘-’ not meaningful for factors
2: In Ops.factor(eta, offset) : ‘-’ not meaningful for factors
3: In Ops.factor(y, mu) : ‘-’ not meaningful for factors

Melted Dataframe 融合数据框

structure(list(Ecosystem.Type = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Degraded", "Restored", 
"Natural"), class = "factor"), Percentage.cover = c(12, 17, 21, 
11, 22, 16, 7, 9, 14, 2, 3, 15, 23, 4, 19, 36, 26, 4, 15, 23, 
38, 46, 65, 35, 54, 29, 48, 13, 19, 33, 37, 55, 11, 53, 13, 24, 
28, 44, 42, 39, 18, 61, 31, 46, 51, 51, 41, 44, 55, 47, 73, 43, 
25, 42, 21, 13, 65, 30, 47, 29)), row.names = c(NA, -60L), .Names =         c("Ecosystem.Type", 
 "Percentage.cover"), class = "data.frame")

在此处输入图片说明

Data 数据

 structure(list(Degraded = c(12L, 17L, 21L, 11L, 22L, 16L, 7L, 
 9L, 14L, 2L, 3L, 15L, 23L, 4L, 19L, 36L, 26L, 4L, 15L, 23L), 
 Restored = c(38L, 46L, 65L, 35L, 54L, 29L, 48L, 13L, 19L, 
 33L, 37L, 55L, 11L, 53L, 13L, 24L, 28L, 44L, 42L, 39L), Natural = c(18L, 
 61L, 31L, 46L, 51L, 51L, 41L, 44L, 55L, 47L, 73L, 43L, 25L, 
 42L, 21L, 13L, 65L, 30L, 47L, 29L)), .Names = c("Degraded", 
 "Restored", "Natural"), class = "data.frame", row.names = c(NA, 
 -20L))

there are several things I would like to point to you. 我想指出几件事。

First, the test 1 and test 2 produce similar results. 首先,测试1和测试2产生相似的结果。 The only difference is that you selected an intercept on test 1 and thus the outcome tells you that if you fit a linear model (I will come to that in a few minutes) intercept is required. 唯一的区别是您在测试1上选择了一个截距,因此结果告诉您是否要拟合线性模型(我将在几分钟内得出结论)需要截距。 Hence the significance you see is about whether the line you force to fit needs an intercept or not. 因此,您所看到的意义是关于您强制拟合的线是否需要截距。 Try using "intercept=T" to the other outcomes and I am pretty sure you will get similar results. 尝试对其他结果使用“ intercept = T”,我很确定您会得到类似的结果。

The second thing you should be careful is about the linear model you try to fit. 您应该注意的第二件事是关于要拟合的线性模型。 The dune.type.1 model is a model where you actually see how correlated the different sand dune ecosystems are. dune.type.1模型是一个模型,您可以在其中实际查看不同沙丘生态系统之间的相关性。 In other words, you assume that there is a linear association between natural and restored and with every unit increase of the restored you have some increase on the natural. 换句话说,您假设自然和还原之间存在线性关联,并且还原的每增加一个单位,自然就会有所增加。 If I understood correctly what you want is to examine the mean differences and not their correlation. 如果我正确理解,您想要检查的是均值差而不是它们的相关性。 Thus you can do two things: 因此,您可以做两件事:

  1. The data is prepared to perform t.tests (a test that compares the mean between several categories). 数据准备好进行t.test(一种比较多个类别之间的平均值的测试)。 It is very easy to do in R and valid since all the variables are reasonably normal. 由于所有变量都相当正常,因此在R中非常容易执行并且有效。 However you will have multiple testing issues (you will perform probably 3 t-tests to get all the mean differences), and thus need to use a Bonferroni correction. 但是,您将遇到多个测试问题(您可能会执行3次t检验以获取所有均值差),因此需要使用Bonferroni校正。

  2. But I think what you really want is the following: 但是我认为您真正想要的是:

First reform the data 首先改革数据

       data <- data.frame(v = c(dune$Degraded, dune$Restored, dune$Natural), 
                   labels = c(rep("Degraded", times = 20), rep("Restored", times = 20), 
                              rep("Natural", times = 20)))

Then fit a linear model 然后拟合线性模型

    mod.1 <- lm(v ~ labels, data = data)
    summary(mod.1)
    lm(formula = v ~ labels, data = data)

Residuals:
Min      1Q  Median      3Q     Max 
-28.650 -10.725   0.875   8.050  31.350 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      14.950      3.066   4.875 9.07e-06 ***
labelsNatural    26.700      4.337   6.157 7.95e-08 ***
labelsRestored   21.350      4.337   4.923 7.64e-06 ***

where you can actually see that the mean of the baseline category (ie the degraded) is significantly smaller with the mean of the natural category and etc. You can also check the model assumptions, to see if your model is a good fit 您实际上可以看到基线类别(即降级类别)的均值明显小于自然类别等的平均值。您还可以检查模型假设,以查看您的模型是否合适

    qqnorm(residuals(mod.1))
    qqline(residuals(mod.1))

在此处输入图片说明 They residuals are reasonably normal so the model is fine. 它们的残差是合理的法线,因此模型很好。 You can also follow your anova approach and have: 您还可以遵循方差分析的方法,并具有:

    anova.model <- aov(v ~ labels, data = data))
    summary(anova.model)

             Df Sum Sq Mean Sq F value   Pr(>F)    
 labels       2   7982    3991   21.22 1.29e-07 ***
 Residuals   57  10720     188  

which indicates that there is at least one significant difference between the means of the sand dune ecosystems, and follow up with Tukey for the pointwise intervals: 这表明沙丘生态系统的均值之间至少存在一个显着差异,并在逐点间隔内跟踪Tukey:

    post <- TukeyHSD(aov(v ~ labels, data = data))
    plot(post, ylim = c(0, 4))

在此处输入图片说明

already adjusted for multiple testing :) 已经针对多种测试进行了调整:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM