简体   繁体   English

R 中 lsmeans 和 SE 计算中的错误自由度

[英]Wrong degrees of freedom in lsmeans and SE calculation in R

I have this sample data:我有这个示例数据:

Sample  Replication Days

    1   1   10
    1   1   14
    1   1   13
    1   1   14
    2   1   NA
    2   1   5
    2   1   18
    2   1   20
    1   2   16
    1   2   NA
    1   2   18
    1   2   21
    2   2   15
    2   2   7
    2   2   12
    2   2   14

I have four observations for each sample with a total of 64 samples in each of the two replications.我对每个样本有四个观察结果,在两次重复中的每一个中总共有 64 个样本。 In total, I have 512 values for both the replications.我总共有 512 个复制值。 I also have some missing values designated as 'NA'.我也有一些缺失值指定为“NA”。 I prformed ANOVA for Mean values for each Sample for each Rep that I generated using我对我使用生成的每个代表的每个样本的平均值进行了方差分析

library(tidyverse)
df <- Data %>% group_by(Sample, Rep) %>% summarise(Mean = mean(Days, na.rm = TRUE)) 
curve.anova <- aov(Mean~Rep+Sample, data=df)

Result of anova is:方差分析的结果是:

> summary(curve.anova) 
            Df Sum Sq Mean Sq F value Pr(>F)    
Rep          1    6.1   6.071   2.951 0.0915 .  
Sample        63 1760.5  27.945  13.585 <2e-16 ***
Residuals   54  111.1   2.057 

I created a table for mean and SE values,我为平均值和 SE 值创建了一个表格,

ANOVA<-lsmeans(curve.anova, ~Sample)
ANOVA<-summary(ANOVA)
write.csv(ANOVA, file="Desktop/ANOVA.csv")

A few lines from file are:文件中的几行是:

Sample  lsmean  SE  df  lower.CL    upper.CL
1       24.875  1.014145417 54  22.84176086 26.90823914
2       25.5    1.014145417 54  23.46676086 27.53323914
3       31.32575758 1.440722628 54  28.43728262 34.21423253
4       26.375  1.014145417 54  24.34176086 28.40823914
5       26.42424242 1.440722628 54  23.53576747 29.31271738
6       25.5    1.014145417 54  23.46676086 27.53323914
7       28.375  1.014145417 54  26.34176086 30.40823914
8       24.875  1.014145417 54  22.84176086 26.90823914
9       21.16666667 1.014145417 54  19.13342752 23.19990581
10      23.875  1.014145417 54  21.84176086 25.90823914

df for all 64 samples is 54 and the error bars in the ggplot are mostly equal for all the Samples.所有 64 个样本的 df 是 54,并且 ggplot 中的误差线对于所有样本几乎都相等。 SE values are larger than the manually calculated values. SE 值大于手动计算的值。 Based on anova results, df=54 is for residuals.根据方差分析结果,df=54 用于残差。

I want to double check the ANOVA results so that they are correct and I am correctly generating lsmeans and SE to plot a bargraph using ggplot with confirdence interval error bars.我想仔细检查方差分析结果,以便它们是正确的,并且我正确生成 lsmeans 和 SE 以使用带有置信区间误差条的 ggplot 绘制条形图。

I will appreciate any help.我将不胜感激任何帮助。 Thank you!谢谢!

After reading your comments, I think your workflow as an issue.阅读您的评论后,我认为您的工作流程存在问题。 Basically, when you are applying your anova test, you are doing it on means of the different samples.基本上,当您应用anova测试时,您是根据不同样本的平均值进行的。 So, in your example, when you are doing :因此,在您的示例中,当您执行以下操作时:

curve.anova <- aov(Mean~Rep+Sample, data=df)

You are comparing these values:您正在比较这些值:

> df
# A tibble: 4 x 3
# Groups:   Sample [2]
  Sample Replication  Mean
   <dbl>       <dbl> <dbl>
1      1           1  12.8
2      1           2  18.3
3      2           1  14.3
4      2           2  12  

So, basically, you are comparing two groups with two values per group.因此,基本上,您正在比较两组,每组有两个值。

So, when you tried to remove the Replication group, you get an error because the output of:因此,当您尝试删除Replication组时,您会收到错误消息,因为以下输出:

df = Data %>% group_by(Sample %>% summarise(Mean = mean(Days, na.rm = TRUE)) 

is now:就是现在:

# A tibble: 2 x 2
  Sample  Mean
   <dbl> <dbl>
1      1  15.1
2      2  13  

So, applying anova test on that dataset means that you are comparing two groups with one value each.因此,对该数据集应用anova测试意味着您正在比较两组各一个值。 So, you can't compute residuals and SE.因此,您无法计算残差和 SE。

Instead, you should do it on the full dataset without trying to calculate the mean first:相反,您应该在完整数据集上执行此操作,而不必先尝试计算平均值:

anova_data <- aov(Days~Sample+Replication, data=Data)
anova_data2 <- aov(Days~Sample, data=Data)

And their output are:他们的输出是:

> summary(anova_data)
            Df Sum Sq Mean Sq F value Pr(>F)
Sample       1  16.07  16.071   0.713  0.416
Replication  1   9.05   9.054   0.402  0.539
Residuals   11 247.80  22.528               
2 observations deleted due to missingness

> summary(anova_data2)
            Df Sum Sq Mean Sq F value Pr(>F)
Sample       1  16.07   16.07   0.751  0.403
Residuals   12 256.86   21.41               
2 observations deleted due to missingness

Now, you can apply lsmeans :现在,您可以应用lsmeans

A_d = summary(lsmeans(anova_data, ~Sample))
A_d2 = summary(lsmeans(anova_data2, ~Sample))

> A_d
 Sample lsmean  SE df lower.CL upper.CL
      1   15.3 1.8 11    11.29     19.2
      2   12.9 1.8 11     8.91     16.9

Results are averaged over the levels of: Replication 
Confidence level used: 0.95 

> A_d2
 Sample lsmean   SE df lower.CL upper.CL
      1   15.1 1.75 12    11.33     19.0
      2   13.0 1.75 12     9.19     16.8

Confidence level used: 0.95 

It does not change a lot the mean and the SE (which is good because it means that your replicate are consistent and you don't have too much variabilities between those) but it reduces the confidence interval.它不会改变均值和 SE(这很好,因为这意味着您的重复是一致的,并且它们之间没有太多的可变性),但它会降低置信区间。

So, to plot it, you can:因此,要绘制它,您可以:

library(ggplot2)
ggplot(A_d, aes(x=as.factor(Sample), y=lsmean)) + 
  geom_bar(stat="identity", colour="black") +
  geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)

在此处输入图片说明


Based on your initial question, if you want to check that the output of ANOVA is correct, you can mimick fake data like this:根据您最初的问题,如果您想检查 ANOVA 的输出是否正确,您可以像这样模拟假数据:

d2 <- data.frame(Sample = c(rep(1,10), rep(2,10)),
                 Days = c(rnorm(10, mean =3), rnorm(10, mean = 8)))

Then,然后,

curve.d2 <- aov(Days ~ Sample, data = d2)
ANOVA2 <- lsmeans(curve.d2, ~Sample)
ANOVA2 <- summary(ANOVA2)

And you get the following output:你会得到以下输出:

> summary(curve.d2)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Sample       1 139.32  139.32   167.7 1.47e-10 ***
Residuals   18  14.96    0.83                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> ANOVA2
 Sample lsmean    SE df lower.CL upper.CL
      1   2.62 0.288 18     2.02     3.23
      2   7.90 0.288 18     7.29     8.51

Confidence level used: 0.95 

And for the plot而对于情节

ggplot(ANOVA2, aes(x=as.factor(Sample), y=lsmean)) + 
    geom_bar(stat="identity", colour="black") +
    geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)

在此处输入图片说明

As you can see, we get lsmeans for d2 close to 3 and 8 what we set at the first place.如您所见,我们得到d2 lsmeans接近我们最初设置的 3 和 8。 So, I think your output are correct.所以,我认为你的输出是正确的。 Maybe your data do not present any significant differences and the computation of SE are the same because the distribution of your data are the same.也许您的数据没有任何显着差异并且 SE 的计算是相同的,因为您的数据分布是相同的。 It is what it is.就是这样。

I hope this answer helps you.我希望这个答案对你有帮助。

Data数据

df = data.frame(Sample = c(rep(1,4), rep(2,4),rep(1,4), rep(2,4)),
                Replication = c(rep(1,8), rep(2,8)),
                Days = c(10,14,13,14,NA,5,18,20,16,NA,18,21,15,7,12,14))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM