简体   繁体   English

GAM 的一阶导数的显着变化点的测试统计是否平滑?

[英]Test statistics for significant change points from First Derivative of GAM smooth?

Questions:问题:

  1. How can I get test statistics for significant change points?如何获得重要变化点的测试统计信息? I'm also unsure what my degrees of freedom should be since derivatives() is sampling data 200 times.我也不确定我的自由度应该是多少,因为derivatives()对数据进行了 200 次采样。

  2. Am I correctly calculating the simultaneous CI?我是否正确计算了同步 CI? Gavin Simpson says that the functions where I got the signifD() function only use point wise intervals. Gavin Simpson 说我得到signifD() function 的函数只使用逐点间隔。 I'm hoping that since I used 'derivatives()' first to calculate simultaneous CI that I'm ok here.我希望因为我首先使用“derivatives()”来计算我在这里没问题的同步 CI。

I am modeling electrofishing CPUE data to describe changes in trends of relative abundance over 20 years for a part of my master's thesis.我正在为电捕鱼 CPUE 数据建模,以描述 20 年来相对丰度趋势的变化,这是我硕士论文的一部分。 To do this I am using the derivatives() function from the gratia package to calculate the first order derivative and simultaneous CI from my gamm() model fit.为此,我使用来自gratia package 的derivatives() function 来计算我的gamm() model 拟合的一阶导数和同步 CI。 I have read many posts on various sites by Gavin Simpson (and others) about how to do this however there seems to have been a lot of changes over the years to the required functions and gratia package and I want to make sure that I am performing all these calculations correctly.我已经阅读了 Gavin Simpson(和其他人)在各个网站上关于如何执行此操作的许多帖子,但是多年来似乎对所需的功能和gratia package 进行了很多更改,我想确保我正在执行所有这些计算正确。 Additionally, my committee members want me to report test statistics for significant change points and I'm not sure how to get these.此外,我的委员会成员希望我报告重大变化点的测试统计数据,但我不确定如何获得这些数据。 Any help would be much appreciated.任何帮助将非常感激。 I included some data at the bottom.我在底部包含了一些数据。

I'm using gratia_0.7.3我正在使用gratia_0.7.3

Set k for gamm() based on number of years of data (in this case, k = 20).根据数据年数为gamm()设置k (在本例中,k = 20)。

(k_uor <- length(unique(dat$year)))

Fit gamm() model electrofishing catch per unit effort (ECPUE) by year using Gamma() distribution and log link via REML . Fit gamm() model electrofishing catch per unit effort (ECPUE) by year 使用Gamma()分布和通过REMLlog链接。

gamm_fit <- gamm((ECPUE + 1) ~ s(year, k = k_uor),
                      family = Gamma(link = "log"), 
                      data = dat,
                      method = "REML")

Create new_data for the years 2001 to 2020为 2001 年到 2020 年创建new_data

new_data <- data.frame(matrix(seq(2001, 2020, 1),
                              ncol = 1))
names(new_data) <- "year"

Then use derivatives() from gratia to get the first order derivative with simultaneous CI from my gamm() model fit for the smooth term s(year) over the study period ( new_data ).然后使用来自gratiaderivatives()从我的gamm() model 获得一阶导数,同时 CI 适合研究期间的平滑项s(year) ( new_data )。

fd_gamm_fit <- derivatives(gamm_fit,
                  term = "s(year)",
                  newdata = new_data,
                  order = 1L,
                  type = "backward",
                  n = 200,
                  eps = 1e-07,
                  interval = "simultaneous",
                  n_sim = 10000
)

I then get predicted data p2 from my model fit during the study period ( new_data ).然后,我在研究期间 ( new_data ) 从我的 model 拟合得到预测数据p2

p2 <- predict(gamm_fit$gam, newdata = new_data, type = "terms", se.fit = TRUE)

Use signifD() provided from Gavin Simpson to get the periods where the trend significantly increases or decreases.使用Gavin Simpson提供的signifD()来获取趋势显着增加或减少的时间段。

signifD <- function(x, d, upper, lower, eval = 0) {
  miss <- upper > eval & lower < eval
  incr <- decr <- x
  want <- d > eval
  incr[!want | miss] <- NA
  want <- d < eval
  decr[!want | miss] <- NA
  list(incr = incr, decr = decr)
  }

m2.dsig <- signifD(p2$fit, 
                   d = fd_gamm_fit$derivative,
                   fd_gamm_fit$upper, 
                   fd_gamm_fit$lower)

which( !(is.na(m2.dsig$incr)) ) %>%
  fd_gamm_fit[.,]

which( !(is.na(m2.dsig$decr)) ) %>%
  fd_gamm_fit[.,]

ECPUE is significantly increasing during 2010 and 2015 and decreasing during 2003 and 2017 ECPUE 在 2010 年和 2015 年期间显着增加,在 2003 年和 2017 年期间减少

Plotting the derivative and simultaneous Confidence intervals.绘制导数和联立置信区间。

draw(fd_gamm_fit,
     alpha = 0.2) +
  theme(axis.text.x = element_text(angle = -45)) +
  scale_x_continuous(breaks = seq(2001, 2020, 1),
                     labels = seq(2001, 2020, 1)) +
  labs(title = "Simultaneous CI first derivative") +
  geom_hline(yintercept = 0)

rm(p2,
   m2.dsig,
   new_data)

具有同步 CI 的一阶导数图

Plot GAMM fit on response scale Plot GAMM 符合响应量表

Predict data预测数据

new_data_pred <- data.frame(matrix(c(seq(2001,2020,0.1)),
                                   nrow = 191, ncol = 1, byrow = FALSE)
)
names(new_data_pred) <- "year"

pred1 <- predict(gamm_fit$gam, new_data_pred, type = "response", se.fit = TRUE)

years <- seq(2001,2020,0.1)
upper_2se <- (pred1$fit-1)+(pred1$se.fit*2)
lwr_2se <- (pred1$fit-1)-(pred1$se.fit*2)

pred_2 <- cbind(years, pred1$fit-1, upper_2se, lwr_2se) %>%
  as.data.frame()
names(pred_2) <- c("years", "pred", "upr_2se", "lwr_2se")

rm(new_data_pred,
   years,
   upper_2se,
   lwr_2se,
   pred1)

Create ggplot创建图表

ggplot(data = pred_2, aes(x = years, y = pred)) +
  labs(title = "") +
  scale_x_continuous(name = "Year",
                     breaks = seq(2001,2020,1),
                     labels = seq(2001,2020,1)) +
  scale_y_continuous(name = "ECPUE") +
  geom_line() + 
  geom_ribbon(aes(ymin = lwr_2se, ymax = upr_2se),
              colour = "grey",
              alpha = 0.2,
              linetype = "blank") +
  geom_line(data = pred_2 %>%
              subset(years >= 2010 &
                       years < 2011),
            aes(x = years, y = pred),
            colour = "blue") +
  geom_line(data = pred_2 %>%
              subset(years >= 2015 &
                       years < 2016),
            aes(x = years, y = pred),
            colour = "blue") +
  geom_line(data = pred_2 %>%
              subset(years >= 2003 &
                       years < 2004),
            aes(x = years, y = pred),
            colour = "red") +
  geom_line(data = pred_2 %>%
              subset(years >= 2017 &
                       years < 2018),
            aes(x = years, y = pred),
            colour = "red") +
  geom_point(data = dat_summary,
             aes(x = year, y = ECPUE))

在响应量表上绘制的显着变化点

Here is some sample data.这是一些示例数据。

dat <- structure(list(year = c(2001, 2001, 2001, 2001, 2001, 2001, 2001, 
2001, 2001, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 
2002, 2002, 2002, 2002, 2002, 2002, 2003, 2003, 2003, 2003, 2003, 
2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2004, 2004, 
2004, 2004, 2004, 2004, 2004, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2006, 
2006, 2006, 2006, 2006, 2006, 2006, 2006, 2007, 2007, 2007, 2007, 
2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 
2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2008, 
2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 
2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 
2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 
2008, 2008, 2008, 2008, 2008, 2008, 2008, 2009, 2009, 2009, 2009, 
2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 
2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 
2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 
2009, 2009, 2009, 2009, 2009, 2009, 2009, 2010, 2010, 2010, 2010, 
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2012, 2012, 
2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 
2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 
2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 
2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 
2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 
2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 
2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 
2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 
2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 
2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 
2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 
2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 
2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 
2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 
2014, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2018, 2018, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 
2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 
2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 
2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 
2019, 2019, 2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020), 
    ECPUE = c(0, 0, 12, 0, 5.99999999999999, 0, 0, 24, 0, 5.99999999999999, 
    0, 12, 0, 5.99999999999999, 5.99999999999999, 12, 0, 0, 0, 
    18, 12, 5.99999999999999, 0, 35.9999999999999, 0, 0, 0, 0, 
    0, 0, 0, 5.99999999999999, 0, 4, 0, 0, 5.99999999999999, 
    5.99999999999999, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 
    0, 0, 5.99999999999999, 0, 0, 0, 0, 0, 0, 5.99999999999999, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5.99999999999999, 
    0, 4, 4, 12, 0, 8, 4, 0, 4, 0, 8, 0, 0, 0, 8, 4, 0, 0, 4, 
    0, 0, 4, 0, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 4, 
    0, 0, 0, 0, 0, 0, 0, 5.99999999999999, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5.99999999999999, 0, 
    0, 0, 0, 0, 4, 12, 0, 0, 0, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 18, 12, 0, 4, 18, 0, 12, 4, 0, 0, 0, 0, 8, 0, 0, 0, 12, 
    4, 0, 0, 4, 0, 18, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 4, 
    4, 48, 32, 36, 12, 16, 12, 0, 48, 0, 8, 36, 0, 0, 32, 4, 
    48, 0, 4, 20, 8, 4, 0, 36, 0, 0, 4, 0, 20, 4, 0, 0, 0, 0, 
    4, 0, 0, 0, 0, 0, 0, 5.99999999999999, 0, 0, 5.99999999999999, 
    5.99999999999999, 0, 0, 0, 0, 5.99999999999999, 5.99999999999999, 
    0, 0, 5.99999999999999, 0, 5.99999999999999, 0, 0, 16, 0, 
    0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 0, 0, 12, 0, 12, 4, 0, 0, 
    48, 0, 0, 4, 0, 5.99999999999999, 0, 0, 0, 4, 12, 0, 12, 
    5.99999999999999, 0, 4, 16, 0, 0, 12, 4, 0, 5.99999999999999, 
    18, 16, 4, 0, 12, 16, 0, 4, 0, 0, 4, 0, 4, 12, 0, 0, 0, 0, 
    4, 0, 4, 0, 0, 0, 0, 0, 12, 0, 4, 0, 4, 0, 5.99999999999999, 
    4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 5.99999999999999, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5.99999999999999, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 0, 0, 12, 8, 
    12, 24, 0, 0, 0, 16, 0, 5.45454545454546, 24, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 32, 24, 0, 0, 4, 0, 0, 0, 5.99999999999999, 
    65.9999999999999, 16, 65.9999999999999, 65.9999999999999, 
    0, 0, 65.9999999999999, 0, 24, 16, 0, 0, 16, 0, 5.99999999999999, 
    0, 0, 24, 0, 0, 0, 0, 4, 0, 0, 0, 8, 4, 0, 0, 8, 4, 8, 4, 
    0, 0, 0, 0, 8, 0, 4, 12, 4, 0, 0, 0, 0, 4, 0, 8, 0, 0, 12, 
    0, 0, 0, 8, 0, 0, 0, 12, 4, 0, 0, 8, 8, 8, 4, 0, 0, 0, 4, 
    8, 0, 16, 0, 8, 0, 0, 4, 0, 4, 0, 4, 0, 8, 0, 0, 0, 0, 0, 
    4, 4, 0, 16, 0, 0, 0, 0, 0, 3.95604395604395, 0, 0, 20, 0, 
    0, 4, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 4, 8, 0, 0, 0, 0, 0, 
    4, 4, 0, 4, 0, 4, 0, 0, 0, 8, 8, 0, 0, 8, 4, 8, 0, 0, 4, 
    8, 4, 0, 3, 4, 4, 4, 5.14285714285715, 0, 10.9090909090909, 
    0, 12, 0, 0, 20, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 3, 20, 0, 
    0, 5.14285714285715, 0)), class = "data.frame", row.names = c(NA, 
-644L))
dat_summary <- structure(list(year = 2001:2020, ECPUE = c(6.66666666666665, 
10.9333333333333, 0.714285714285713, 0, 0, 0.5, 0.639999999999999, 
1.26829268292683, 1.72727272727272, 3.15151515151515, 11.2, 3.71428571428571, 
4.54285714285714, 1.26923076923077, 4.96405919661733, 7.78947368421051, 
2.90909090909091, 3.0465724751439, 3.41463414634146, 3.55351545006718
)), row.names = c(NA, -20L), class = "data.frame")

There are a few things wrong here:这里有一些错误:

  1. I'd fit this with the tw() family using gam() not gamm() .我会使用gam()而不是gamm()将其与tw()系列相匹配。 You have some 0s, and these aren't allowed in the Gamma family so you are adding 1 to your response, but I don't see any reason why you need to use gamm() here.您有一些 0,这些在Gamma系列中是不允许的,因此您要在响应中加 1,但我看不出有任何理由需要在这里使用gamm()

  2. Better, assuming you have the effort variable would be to model these as actually counts with poisson() or nb() families and include the effort via an offset term - add + offset(log(effort)) to your model formula, where effort is the variable that normalises the counts to catch per unit effort.更好的是,假设你的effort变量将是 model 这些实际上是用poisson()nb()系列计算的,并通过偏移项包括努力 - 添加+ offset(log(effort))到你的 model 公式,其中effort是将每单位努力捕获的计数标准化的变量。 Absent that variable, the tw() is the best option.没有那个变量, tw()是最好的选择。

  3. You are computing the confidence interval ( upper_2se etc) using predictions from the response scale and that is incorrect: as your plot shows, the interval contains negative values which just doesn't make sense.您正在使用来自响应量表的预测来计算置信区间( upper_2se等),这是不正确的:正如您的 plot 所示,该区间包含负值,这是没有意义的。 instead, use type = "link" when predicting, and then form the confidence interval as you do it now, but then backtransform the upper and lower limits and the fitted values to the response scale using the inverse of the link function (in all the cases mentioned — Gamma , tw , nb poisson — the link function is log() , so you'd want exp() for the inverse. That way you'll get a confidence interval that respects the properties of the data.相反,在预测时使用type = "link" ,然后像现在一样形成置信区间,然后使用 link function 的倒数(在所有提到的案例 — Gammatwnb poisson — 链接 function 是log() ,因此您需要exp()作为逆函数。这样您将获得尊重数据属性的置信区间。

  4. I'd compute the derivative over a much finer grid than you do — you are estimating the derivative over a very small interval around the integer year values.我会在比你更精细的网格上计算导数——你是在 integer 年值周围的一个非常小的区间内估计导数。

    This is actually causing you some problems as shown in the last plot;这实际上给你带来了一些问题,如上一个 plot 所示; each of the blue spikes should be associated with a red decrease, but because of the crude grid of points you are evaluating the derivative at, you simply aren't seeing the significant decreases.每个蓝色尖峰都应与红色下降相关联,但由于您评估导数的粗略网格点,您根本看不到显着下降。 That last red section is also likely spurious.最后一个红色部分也可能是虚假的。

    If you want an instantaneous measure of the derivative you should do like you do when you plot the smooth and do it for a finer grid, say 100-200 points over the range min(year) to max(year) with seq(min(year), max(year), length = 200) say.如果您想对导数进行瞬时测量,您应该像 plot 那样平滑并为更精细的网格执行此操作,例如在min(year)max(year)范围内使用seq(min(year), max(year), length = 200)说。

    Or, if you want the derivative over the entire year (?) you would have to increase eps to be some much larger value, like 1 , and I would suggest you do a central derivative so you aren't extrapolating too much beyond the start/end of the time series.或者,如果你想要全年的导数(?),你必须将eps增加到更大的值,比如1 ,我建议你做一个中心导数,这样你就不会在开始时外推太多/时间序列结束。

Remember these are derivatives on the link (log) scale;请记住,这些是链接(对数)尺度上的导数; they aren't response level changes.它们不是响应级别的变化。 Response derivatives is a bit more involved but should be in {gratia} in a few weeks.响应导数有点复杂,但应该会在几周内出现。

I'll try to work up a full example from your data, but I also have some questions which I'll post as comments first.我会尝试从您的数据中得出一个完整的示例,但我也有一些问题,我会先作为评论发布。 Consider this a WIP.将其视为 WIP。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM