简体   繁体   中英

Grouping by statistical significance in R

I am trying to figure out the best way to group my data into factors to use for creating an MDS plot. I have annual discharge data that I would like to break into "high", "medium" and "low" flow years (or just high and low if a medium can't be determined). My issue is that I don't know the best way to do this. I can look at plots and pick the high, med, low, but I want a statistically sound way of doing this rather than just eyeballing it.

I have this data

structure(list(Zone = c("B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B"), Year = c(2008L, 2009L, 2010L, 2011L, 2012L, 
2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2008L, 2009L, 2010L, 
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 
2016L, 2017L, 2018L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 2018L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2008L, 2009L, 
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L
), Month = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L), Discharge = c(10, 77.8, 273.6, 29.3, 
1.5, 86.1, 483.4, 84.8, 53.4, 50.3, 131.3, 3.79, 67.8, 97.1, 
10.7, 144.4, 62.6, 171.6, 29.2, 36.2, 179.6, 383.5, 2.39, 32.4, 
75.2, 7.85, 436.1, 630.1, 57.2, 27.7, 25.8, 73.7, 314.5, 175.7, 
22.8, 68, 4.82, 329.9, 573.8, 34.5, 55.1, 40.1, 150.3, 416.7, 
267.3, 37, 65.3, 5.41, 533.9, 255.2, 69.3, 36.8, 158.8, 95.7, 
271.1, 45.3, 19.9, 74.3, 2.15, 222.2, 59.5, 36, 29.5, 38.7, 56.4, 
65.2, 28.4, 17.2, 19.5, 1.31, 49, 30.8, 29.6, 39.5, 19.5, 35.8, 
123.8), Date = c("2008-05", "2009-05", "2010-05", "2011-05", 
"2012-05", "2013-05", "2014-05", "2015-05", "2016-05", "2017-05", 
"2018-05", "2008-06", "2009-06", "2010-06", "2011-06", "2012-06", 
"2013-06", "2014-06", "2015-06", "2016-06", "2017-06", "2018-06", 
"2008-07", "2009-07", "2010-07", "2011-07", "2012-07", "2013-07", 
"2014-07", "2015-07", "2016-07", "2017-07", "2018-07", "2008-08", 
"2009-08", "2010-08", "2011-08", "2012-08", "2013-08", "2014-08", 
"2015-08", "2016-08", "2017-08", "2018-08", "2008-09", "2009-09", 
"2010-09", "2011-09", "2012-09", "2013-09", "2014-09", "2015-09", 
"2016-09", "2017-09", "2018-09", "2008-10", "2009-10", "2010-10", 
"2011-10", "2012-10", "2013-10", "2014-10", "2015-10", "2016-10", 
"2017-10", "2018-10", "2008-11", "2009-11", "2010-11", "2011-11", 
"2012-11", "2013-11", "2014-11", "2015-11", "2016-11", "2017-11", 
"2018-11")), row.names = c(228L, 229L, 230L, 231L, 232L, 233L, 
234L, 235L, 236L, 237L, 238L, 242L, 243L, 244L, 245L, 246L, 247L, 
248L, 249L, 250L, 251L, 252L, 256L, 257L, 258L, 259L, 260L, 261L, 
262L, 263L, 264L, 265L, 266L, 270L, 271L, 272L, 273L, 274L, 275L, 
276L, 277L, 278L, 279L, 280L, 284L, 285L, 286L, 287L, 288L, 289L, 
290L, 291L, 292L, 293L, 294L, 298L, 299L, 300L, 301L, 302L, 303L, 
304L, 305L, 306L, 307L, 308L, 312L, 313L, 314L, 315L, 316L, 317L, 
318L, 319L, 320L, 321L, 322L), class = "data.frame")

在此处输入图像描述

I ran a pairwise comparison using Wilcoxon rank sum test to determine significant differences between discharge rates for each year and got this:

pairwise.wilcox.test( B_Dis_data$Discharge, B_Dis_data$Year, p.adjust.method = "BH")
    Pairwise comparisons using Wilcoxon rank sum test 

data:  B_Dis_data$Discharge and B_Dis_data$Year 

     2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
2009 0.868 -     -     -     -     -     -     -     -     -    
2010 0.328 0.191 -     -     -     -     -     -     -     -    
2011 0.283 0.020 0.011 -     -     -     -     -     -     -    
2012 0.235 0.133 0.366 0.069 -     -     -     -     -     -    
2013 0.167 0.110 0.781 0.008 1.000 -     -     -     -     -    
2014 0.366 0.283 0.710 0.008 0.437 0.366 -     -     -     -    
2015 0.710 0.654 0.191 0.014 0.133 0.069 0.328 -     -     -    
2016 0.781 0.654 0.208 0.014 0.167 0.110 0.514 1.000 -     -    
2017 0.328 0.110 0.954 0.008 0.366 0.583 0.710 0.110 0.191 -    
2018 0.069 0.014 0.133 0.008 1.000 0.583 0.191 0.011 0.020 0.110

P value adjustment method: BH 

Any advice would be greatly appreciated!!

Referring to your question about cut :

You could try to ave rage to year mean s and accordingly cut the data into the three categories.

B_Dis_data <- within(B_Dis_data, {
  Discharge.M <- ave(Discharge, Year, FUN=mean)
  Discharge.C.eq <- cut(Discharge.M, breaks=3, c("low", "medium", "high"))
  Discharge.C.qu <- cut(B_Dis_data$Discharge.M, 
                        quantile(Discharge.M, seq(0, 1, length.out=4)), 
                        c("low", "medium", "high"), include.lowest=TRUE)
})

Either cut them into three equal sized intervals using breaks=3 ( Discharge.C.eq ) or use the quantile s 1/3 and 2/3 as break points ( Discharge.C.qu ) (which might be better).

What is happening within cut :

Equal sized intervals:

levels(cut(B_Dis_data$Discharge.M, 3))
# [1] "(8.55,87.6]" "(87.6,166]"  "(166,246]" 
## same as: 
(B_Dis.rg <- range(B_Dis_data$Discharge.M))
# [1]   8.791429 245.285714
c(B_Dis.rg[1], B_Dis.rg[1] + (B_Dis.rg[2] - B_Dis.rg[1])/3*c(1:3))
# [1]   8.791429  87.622857 166.454286 245.285714

Quantiles (terciles):

quantile(B_Dis_data$Discharge.M, seq(0, 1, length.out=4))
#       0%  33.33333%  66.66667%       100% 
# 8.791429  53.214286 125.942857 245.285714 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM