带有块设计和重复测量的方差分析

Question

I'm attempting to run some statistical analyses on a field trial that was constructed over 2 sites over the same growing season. 我正在尝试对一个田间试验进行一些统计分析，该田间试验是在同一生长季节内在两个地点进行的。

At both sites ( Site , levels: HF|NW) the experimental design was a RCBD with 4 (n=4) blocks ( Block , levels: 1|2|3|4 within each Site ). 在两个站点（ Site ，级别：HF | NW）的实验设计是具有4（N = 4）块RCBD（ Block ，级别：1 | 2 | 3 |每个内4 Site ）。 There were 4 treatments - 3 different forms of nitrogen fertiliser and a control (no nitrogen fertiliser) ( Treatment , levels: AN, U, IU, C). 有4种处理方式-3种不同形式的氮肥和一种对照（无氮肥）（ Treatment ，水平：AN，U，IU，C）。 During the field trial there were 3 distinct periods that commenced with fertiliser addition and ended with harvesting of the grass. 在田间试验中，有3个不同的时期，从添加肥料开始到收获草结束。 These periods have been given the levels 1|2|3 under the factor N_app . 在因子N_app下，已将这些时间段的级别N_app 1 | 2 | 3。

There are a range of measurements that I would like to test the following null hypothesis H0 on: 我想测试以下一系列零假设H0：

Treatment (H0) had no effect on measurement Treatment （H0）对测量没有影响

Two of the measurements I am particularly interested in are: grass yield and ammonia emissions. 我特别感兴趣的两个度量是：草产量和氨排放量。

Starting with grass yield ( Dry_tonnes_ha ) as shown here, a nice balanced data set 从如下所示的草产量（ Dry_tonnes_ha ）开始，一个很好的平衡数据集

The data can be downloaded in R using the following code: 可以使用以下代码将数据下载到R中：

library(tidyverse)

download.file('https://www.dropbox.com/s/w5ramntwdgpn0e3/HF_NW_grass_yield_data.csv?raw=1', destfile = "HF_NW_grass_yield_data.csv", method = "auto")
raw_data <- read.csv("HF_NW_grass_yield_data.csv", stringsAsFactors = FALSE)

HF_NW_grass <- raw_data %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>% 
  mutate(Date = as.Date(Date, format = "%d/%m/%Y"),
         Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))

I have had a go at running an ANOVA on this using the following approach: 我可以使用以下方法来对此进行方差分析：

model_1 <- aov(formula = Dry_tonnes_ha ~ Treatment * N_app + Site/Block, data = HF_NW_grass, projections = TRUE)

I have a few concerns with this. 我对此有一些担忧。

Firstly, what is the best way to test assumptions? 首先，检验假设的最佳方法是什么？ For a simple one-way ANOVA I would use shapiro.test() and bartlett.test() on the dependent variable ( Dry_tonnes_ha ) to assess normality and heterogeneity of variance. 对于简单的单向方差分析，我将对因变量（ Dry_tonnes_ha ）使用shapiro.test()和bartlett.test() ）来评估方差的正态性和异质性。 Can I use the same approach here? 我可以在这里使用相同的方法吗？

Secondly, I am concerned that N_app is a repeated measure as the same measurement is taken from the same plot over 3 different periods - what is the best way to build this repeated measures into the model? 其次，我担心N_app是重复测量，因为在3个不同时期从同一地块获取了相同的测量-将这种重复测量构建到模型中的最佳方法是什么？

Thirdly, I'm not sure of the best way to nest Block within Site . 第三，我不确定在Site嵌套Block的最佳方法。 At both sites the levels of Block are 1:4. 在两个站点上， Block的级别均为1：4。 Do I need to have unique Block levels for each site? 每个站点都需要具有唯一的Block级别吗？

I have another data set for NH3 emissions here . 在这里，我还有另一个关于NH3排放的数据集。 R code to download: R代码下载：

download.file('https://www.dropbox.com/s/0ax16x95m2z3fb5/HF_NW_NH3_emissions.csv?raw=1', destfile = "HF_NW_NH3_emissions.csv", method = "auto")
raw_data_1 <- read.csv("HF_NW_NH3_emissions.csv", stringsAsFactors = FALSE)

HF_NW_NH3 <- raw_data_1 %>% mutate_at(vars(Site, N_app, Block, Plot, Treatment), as.factor) %>% 
  mutate(Treatment = factor(Treatment, levels = c("AN", "U", "IU", "C")))

For this I have all the concerns above with the addition that the data set is unbalanced. 为此，除了数据集不平衡之外，我还有上述所有问题。 At HF for N_app 1 n=3, but for N_app 2 & 3 n=4 At NW n=4 for all N_app levels. 在HF对于N_app 1，n = 3，但对于N_app 2和3，n = 4在NW对于所有N_app级别，n = 4。 At NF measurements were only made on the Treatment levels U and IU At NW measuremnts were made on Treatment levels AN , U and IU 在NF ，仅在Treatment水平U和IU测量；在NW ，对Treatment水平AN ， U和IU

I'm not sure how to deal with this added level of complexity. 我不确定如何处理这种增加的复杂性。 I am tempted to just analyse as 2 separate site (the fact that the N_app periods are not the same at each site may encourage this approach). 我很想将其分析为2个单独的站点（每个站点的N_app周期都不相同的事实可能会鼓励这种方法）。 Can I use a type iii sum of squares ANOVA here? 我可以在这里使用iii型平方和方差分析吗？

It has been suggested to me that a linear mixed modelling approach may be the way forward but I'm not familiar with using these. 有人向我建议，线性混合建模方法可能是前进的方法，但我对使用它们并不熟悉。

I would welcome your thoughts on any of the above. 我欢迎您对以上任何想法。 Thanks for your time. 谢谢你的时间。

Rory 罗里

Answer 1

To answer your first question on the best way of testing assumptions. 要回答关于测试假设的最佳方法的第一个问题。 While your attempt of using another statistical test, implemented in R, is reasonable, I would actually just visualize the distribution and see if the data meet ANOVA assumptions. 虽然您尝试使用在R中实现的另一种统计检验的尝试是合理的，但实际上我只是可视化分布并查看数据是否符合ANOVA假设。 This approach may seem somewhat subjective, but it does work in most cases. 这种方法似乎有些主观，但在大多数情况下确实有效。

independently, identically distributed (iid) data: this is a question that you may already have an answer based on how much you know about your data. 独立地，均布的（iid）数据：这是一个问题，您可能已经基于对数据的了解程度得出了答案。 It's possible to use a chi-square test to determine independence (or not). 可以使用卡方检验来确定（或不可以）独立性。
normally distributed data: use a histogram / QQ plot to check. 正态分布数据：使用直方图/ QQ图进行检查。 Based on the distribution, I think it is reasonable to use aov despite the slightly bimodal distribution. 基于分布，我认为尽管存在双峰分布，但使用aov是合理的。

(It appears that log-transformation help further meet normality assumption. This is something you may consider, especially for downstream analyses.) （看来，对数转换有助于进一步满足正态性假设。您可能会考虑这一点，尤其是对于下游分析而言。）

par(mfrow=c(2,2))
plot(density(HF_NW_grass$Dry_tonnes_ha), col="red", main="Density")
qqnorm(HF_NW_grass$Dry_tonnes_ha, col="red", main="qqplot")
qqline(HF_NW_grass$Dry_tonnes_ha)

DTH_trans <- log10(HF_NW_grass$Dry_tonnes_ha)
plot(density(DTH_trans), col="blue", main="transformed density")
qqnorm(DTH_trans, col="blue", main="transformed density")
qqline(DTH_trans)

Regarding your second question on what the best way to build repeated measures into the model is: Unfortunately, it is difficult to pinpoint such a "best" model, but based on my knowledge (mostly through genomics big data), you may want to use a linear mixed effect model. 关于在模型中构建重复度量的最佳方法的第二个问题是：不幸的是，很难确定这种“最佳”模型，但是基于我的知识（主要是通过基因组学大数据），您可能想要使用线性混合效应模型。 This can be implemented through the lme4 R package, for example. 例如，这可以通过lme4 R包实现。 Since it appears you already know how to construct a linear model in R, you should have no problem with applying lme4 functions. 由于您似乎已经知道如何在R中构造线性模型，因此应用lme4函数应该没有问题。

Your third question regarding whether to nest two variables is tricky. 关于是否嵌套两个变量的第三个问题很棘手。 If I were you, I would start with Site and Block as if they were independent factors. 如果您是我，我将从Site和Block开始，就像它们是独立因素一样。 However, if you know they are not independent, you should probably nest them. 但是，如果您知道它们不是独立的，则应该嵌套它们。

I think your questions and concerns are quite open-ended. 我认为您的问题和疑虑是无限的。 My recommendation is that as long as you have a plausible justification, go ahead and proceed. 我的建议是，只要您有合理的理由，请继续进行。

Answer 2

I agree with @David C on the use of visual diagnostics. 我同意@David C关于视觉诊断的使用。 Simple QQ plots should work 简单的QQ图应该有效

# dependent variable.
par(mfrow=c(1,2))
qqnorm(dt[,dry_tonnes_ha]); qqline(dt[,dry_tonnes_ha], probs= c(0.15, 0.85))
qqnorm(log(dt[,dry_tonnes_ha])); qqline(log(dt[,dry_tonnes_ha]), probs= c(0.15, 0.85))

The log transformation looks reasonable to me. 对我来说，日志转换看起来很合理。 You can also see this from the density plot, which is long tailed and somewhat bi-modal 您还可以从密度图上看到这一点，它是长尾的并且有点双峰

par(mfrow=c(1,1))
plot(density(dt[,dry_tonnes_ha]))

You could alternatively use lineup plots (Buja et al, 2009) if you wish. 如果愿意，您也可以使用阵容图（Buja等，2009）。 I'm not sure they're needed in this case. 在这种情况下，我不确定是否需要它们。 Vignette provided 提供小插图

library(nullabor)
# this may not be the best X variable. I'm not familiar with your data
dt_l <- lineup(null_permute("dry_tonnes_ha"), dt)
qplot(dry_tonnes_ha, treatment, data = dt_l) + facet_wrap(~ .sample)

For the other assumptions, you can just use the standard diagnostic plots from the lm 对于其他假设，您可以只使用lm的标准诊断图

lm2 <- lm(log(dry_tonnes_ha) ~ treatment * n_app + site/block, data = dt)
plot(lm2)

I don't see anything too troublesome in these plots. 在这些情节中，我认为没有什么太麻烦的事。

带有块设计和重复测量的方差分析

问题描述

2 个解决方案

解决方案1
4 2017-01-27 02:48:45

解决方案2
1 2017-02-01 21:54:06

带有块设计和重复测量的方差分析

问题描述

2 个解决方案

解决方案1 4 2017-01-27 02:48:45

解决方案2 1 2017-02-01 21:54:06

解决方案1
4 2017-01-27 02:48:45

解决方案2
1 2017-02-01 21:54:06