简体   繁体   English

带有块设计和重复测量的方差分析

[英]ANOVA with block design and repeated measures

I'm attempting to run some statistical analyses on a field trial that was constructed over 2 sites over the same growing season. 我正在尝试对一个田间试验进行一些统计分析,该田间试验是在同一生长季节内在两个地点进行的。

At both sites ( Site , levels: HF|NW) the experimental design was a RCBD with 4 (n=4) blocks ( Block , levels: 1|2|3|4 within each Site ). 在两个站点( Site ,级别:HF | NW)的实验设计是具有4(N = 4)块RCBD( Block ,级别:1 | 2 | 3 |每个内4 Site )。 There were 4 treatments - 3 different forms of nitrogen fertiliser and a control (no nitrogen fertiliser) ( Treatment , levels: AN, U, IU, C). 有4种处理方式-3种不同形式的氮肥和一种对照(无氮肥)( Treatment ,水平:AN,U,IU,C)。 During the field trial there were 3 distinct periods that commenced with fertiliser addition and ended with harvesting of the grass. 在田间试验中,有3个不同的时期,从添加肥料开始到收获草结束。 These periods have been given the levels 1|2|3 under the factor N_app . 在因子N_app下,已将这些时间段的级别N_app 1 | 2 | 3。

There are a range of measurements that I would like to test the following null hypothesis H0 on: 我想测试以下一系列零假设H0:

Treatment (H0) had no effect on measurement Treatment (H0)对测量没有影响

Two of the measurements I am particularly interested in are: grass yield and ammonia emissions. 我特别感兴趣的两个度量是:草产量和氨排放量。

Starting with grass yield ( Dry_tonnes_ha ) as shown here, a nice balanced data set 从如下所示的草产量( Dry_tonnes_ha开始,一个很好的平衡数据集

The data can be downloaded in R using the following code: 可以使用以下代码将数据下载到R中:

library(tidyverse)

download.file('https://www.dropbox.com/s/w5ramntwdgpn0e3/HF_NW_grass_yield_data.csv?raw=1', destfile = "HF_NW_grass_yield_data.csv", method = "auto")
raw_data <- read.csv("HF_NW_grass_yield_data.csv", stringsAsFactors = FALSE)

HF_NW_grass <- raw_data %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>% 
  mutate(Date = as.Date(Date, format = "%d/%m/%Y"),
         Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))

I have had a go at running an ANOVA on this using the following approach: 我可以使用以下方法来对此进行方差分析:

model_1 <- aov(formula = Dry_tonnes_ha ~ Treatment * N_app + Site/Block, data = HF_NW_grass, projections = TRUE)

I have a few concerns with this. 我对此有一些担忧。

Firstly, what is the best way to test assumptions? 首先,检验假设的最佳方法是什么? For a simple one-way ANOVA I would use shapiro.test() and bartlett.test() on the dependent variable ( Dry_tonnes_ha ) to assess normality and heterogeneity of variance. 对于简单的单向方差分析,我将对因变量( Dry_tonnes_ha )使用shapiro.test()bartlett.test() )来评估方差的正态性和异质性。 Can I use the same approach here? 我可以在这里使用相同的方法吗?

Secondly, I am concerned that N_app is a repeated measure as the same measurement is taken from the same plot over 3 different periods - what is the best way to build this repeated measures into the model? 其次,我担心N_app是重复测量,因为在3个不同时期从同一地块获取了相同的测量-将这种重复测量构建到模型中的最佳方法是什么?

Thirdly, I'm not sure of the best way to nest Block within Site . 第三,我不确定在Site嵌套Block的最佳方法。 At both sites the levels of Block are 1:4. 在两个站点上, Block的级别均为1:4。 Do I need to have unique Block levels for each site? 每个站点都需要具有唯一的Block级别吗?

I have another data set for NH3 emissions here . 在这里,我还有另一个关于NH3排放的数据集 R code to download: R代码下载:

download.file('https://www.dropbox.com/s/0ax16x95m2z3fb5/HF_NW_NH3_emissions.csv?raw=1', destfile = "HF_NW_NH3_emissions.csv", method = "auto")
raw_data_1 <- read.csv("HF_NW_NH3_emissions.csv", stringsAsFactors = FALSE)

HF_NW_NH3 <- raw_data_1 %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>% 
  mutate(Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))

For this I have all the concerns above with the addition that the data set is unbalanced. 为此,除了数据集不平衡之外,我还有上述所有问题。 At HF for N_app 1 n=3, but for N_app 2 & 3 n=4 At NW n=4 for all N_app levels. HF对于N_app 1,n = 3,但对于N_app 2和3,n = 4在NW对于所有N_app级别,n = 4。 At NF measurements were only made on the Treatment levels U and IU At NW measuremnts were made on Treatment levels AN , U and IU NF ,仅在Treatment水平UIU测量;在NW ,对Treatment水平ANUIU

I'm not sure how to deal with this added level of complexity. 我不确定如何处理这种增加的复杂性。 I am tempted to just analyse as 2 separate site (the fact that the N_app periods are not the same at each site may encourage this approach). 我很想将其分析为2个单独的站点(每个站点的N_app周期都不相同的事实可能会鼓励这种方法)。 Can I use a type iii sum of squares ANOVA here? 我可以在这里使用iii型平方和方差分析吗?

It has been suggested to me that a linear mixed modelling approach may be the way forward but I'm not familiar with using these. 有人向我建议,线性混合建模方法可能是前进的方法,但我对使用它们并不熟悉。

I would welcome your thoughts on any of the above. 我欢迎您对以上任何想法。 Thanks for your time. 谢谢你的时间。

Rory 罗里

To answer your first question on the best way of testing assumptions. 要回答关于测试假设的最佳方法的第一个问题。 While your attempt of using another statistical test, implemented in R, is reasonable, I would actually just visualize the distribution and see if the data meet ANOVA assumptions. 虽然您尝试使用在R中实现的另一种统计检验的尝试是合理的,但实际上我只是可视化分布并查看数据是否符合ANOVA假设。 This approach may seem somewhat subjective, but it does work in most cases. 这种方法似乎有些主观,但在大多数情况下确实有效。

  • independently, identically distributed (iid) data: this is a question that you may already have an answer based on how much you know about your data. 独立地,均布的(iid)数据:这是一个问题,您可能已经基于对数据的了解程度得出了答案。 It's possible to use a chi-square test to determine independence (or not). 可以使用卡方检验来确定(或不可以)独立性。
  • normally distributed data: use a histogram / QQ plot to check. 正态分布数据:使用直方图/ QQ图进行检查。 Based on the distribution, I think it is reasonable to use aov despite the slightly bimodal distribution. 基于分布,我认为尽管存在双峰分布,但使用aov是合理的。

(It appears that log-transformation help further meet normality assumption. This is something you may consider, especially for downstream analyses.) (看来,对数转换有助于进一步满足正态性假设。您可能会考虑这一点,尤其是对于下游分析而言。)

par(mfrow=c(2,2))
plot(density(HF_NW_grass$Dry_tonnes_ha), col="red", main="Density")
qqnorm(HF_NW_grass$Dry_tonnes_ha, col="red", main="qqplot")
qqline(HF_NW_grass$Dry_tonnes_ha)

DTH_trans <- log10(HF_NW_grass$Dry_tonnes_ha)
plot(density(DTH_trans), col="blue", main="transformed density")
qqnorm(DTH_trans, col="blue", main="transformed density")
qqline(DTH_trans)

Regarding your second question on what the best way to build repeated measures into the model is: Unfortunately, it is difficult to pinpoint such a "best" model, but based on my knowledge (mostly through genomics big data), you may want to use a linear mixed effect model. 关于在模型中构建重复度量的最佳方法的第二个问题是:不幸的是,很难确定这种“最佳”模型,但是基于我的知识(主要是通过基因组学大数据),您可能想要使用线性混合效应模型。 This can be implemented through the lme4 R package, for example. 例如,这可以通过lme4 R包实现。 Since it appears you already know how to construct a linear model in R, you should have no problem with applying lme4 functions. 由于您似乎已经知道如何在R中构造线性模型,因此应用lme4函数应该没有问题。

Your third question regarding whether to nest two variables is tricky. 关于是否嵌套两个变量的第三个问题很棘手。 If I were you, I would start with Site and Block as if they were independent factors. 如果您是我,我将从SiteBlock开始,就像它们是独立因素一样。 However, if you know they are not independent, you should probably nest them. 但是,如果您知道它们不是独立的,则应该嵌套它们。

I think your questions and concerns are quite open-ended. 我认为您的问题和疑虑是无限的。 My recommendation is that as long as you have a plausible justification, go ahead and proceed. 我的建议是,只要您有合理的理由,请继续进行。

I agree with @David C on the use of visual diagnostics. 我同意@David C关于视觉诊断的使用。 Simple QQ plots should work 简单的QQ图应该有效

# dependent variable.
par(mfrow=c(1,2))
qqnorm(dt[,dry_tonnes_ha]); qqline(dt[,dry_tonnes_ha], probs= c(0.15, 0.85))
qqnorm(log(dt[,dry_tonnes_ha])); qqline(log(dt[,dry_tonnes_ha]), probs= c(0.15, 0.85))

在此处输入图片说明

The log transformation looks reasonable to me. 对我来说,日志转换看起来很合理。 You can also see this from the density plot, which is long tailed and somewhat bi-modal 您还可以从密度图上看到这一点,它是长尾的并且有点双峰

par(mfrow=c(1,1))
plot(density(dt[,dry_tonnes_ha]))

You could alternatively use lineup plots (Buja et al, 2009) if you wish. 如果愿意,您也可以使用阵容图(Buja等,2009)。 I'm not sure they're needed in this case. 在这种情况下,我不确定是否需要它们。 Vignette provided 提供小插图

library(nullabor)
# this may not be the best X variable. I'm not familiar with your data
dt_l <- lineup(null_permute("dry_tonnes_ha"), dt)
qplot(dry_tonnes_ha, treatment, data = dt_l) + facet_wrap(~ .sample)

在此处输入图片说明

For the other assumptions, you can just use the standard diagnostic plots from the lm 对于其他假设,您可以只使用lm的标准诊断图

lm2 <- lm(log(dry_tonnes_ha) ~ treatment * n_app + site/block, data = dt)
plot(lm2)

I don't see anything too troublesome in these plots. 在这些情节中,我认为没有什么太麻烦的事。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM