[英]stat_smooth gam not the same as gam {mgcv}
I was using the stat_smooth function in ggplot2, decided I wanted the "goodness of fit", and used mgvc gam for that. 我在ggplot2中使用stat_smooth函数,决定我想要“合适的好”,并使用mgvc gam。 It occurred to me that I should check to make sure that they were the same model (stat_smooth vs mgvc gam), so I used the code below to check.
在我看来,我应该检查以确保它们是相同的模型(stat_smooth vs mgvc gam),所以我使用下面的代码来检查。 Seemingly, they have different results, as evidenced by the plot ( Plot: stat_smoother gam (red), mgcv gam (black) ).
看起来,他们有不同的结果,如情节所示( 情节:stat_smoother gam(红色),mgcv gam(黑色) )。 However, I don't know why they have different results.
但是,我不知道他们为什么会有不同的结果。 Is it that some default parameter is different between the two?
这两个默认参数有什么不同? Is is that gam is being run on a numeric x and stat_smooth is being run with a POSIXct x (if so - I don't know what to do about that)?
是gam是在数字x上运行而stat_smooth是用POSIXct x运行的(如果是这样 - 我不知道该怎么做)? It looks like stat_smooth is smoother, but the k values are the same...
看起来stat_smooth更平滑,但k值是相同的......
I think there are several posts on how to plot gam outputs in ggplot2, but I'd really like to know why stat_smooth and mgcv are giving different results in the first place. 我想有几个关于如何在ggplot2中绘制gam输出的帖子,但我真的想知道为什么stat_smooth和mgcv首先给出了不同的结果。 I am very new to GAM (and R), so it's quite possible I'm missing something easy.
我是GAM(和R)的新手,所以我很可能会错过一些简单的东西。 However, I did google and search this forum before asking.
但是,我确实谷歌并在询问前搜索此论坛。
My data is a bit big to easily share, so I used a sample dataset - I've put the source in the code, as well as a dput()
below everything, and my sessionInfo()
after that. 我的数据有点大,很容易分享,所以我使用了一个示例数据集 - 我将源代码放在代码中,以及在所有内容下面的
dput()
,然后是我的sessionInfo()
。
I have tried to make a quality question, but it is only my second one. 我试图提出质量问题,但这只是我的第二个问题。 Ever.
永远。 So, constructive criticism is appreciated.
因此,赞赏建设性的批评。
Thank you! 谢谢!
library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(mgcv)
stackOF_data <- read_excel("mean-daily-flow-cumecs-vatnsdals.xlsx", sheet = "Data")
stackOF_data <- data.table(stackOF_data)
stackOF_data <- stackOF_data[,.(timeseries=as.POSIXct(Date,format("%Y-%m-%d")),mdf)]
a <- stackOF_data[,.(x=as.numeric(timeseries),y=mdf)]
a1 <- gam(y~s(x, k=100, bs="cs"),data=a)
a2=data.table(gam_mdf= predict(a1,a))
a2=cbind(timeseries=stackOF_data$timeseries,a2)
# see if predict and actual are the same
p <- ggplot() +
geom_line(data = a2, aes(x = timeseries, y = gam_mdf), size=1)+
scale_color_manual(values=c("black","magenta"))+
scale_y_continuous()+
scale_x_datetime(labels = date_format("%Y-%m-%d"), breaks = "1 month", minor_breaks = "1 week")+
theme(axis.text.x=element_text(angle=50, size=10,hjust=1))+
stat_smooth(data = stackOF_data, aes(x = (timeseries), y = mdf),method="gam", formula=y~s(x,k=100, bs="cs"), col="red", se=FALSE, size=1)
p
# data from: https://datamarket.com/data/set/235m/mean-daily-flow-cumecs-vatnsdalsa-river-1-jan-1972-31-dec-1974#!ds=235m&display=line&s=14l
> dput(a)
structure(list(x = c(126230400, 126316800, 126403200, 126489600,
126576000, 126662400, 126748800, 126835200, 126921600, 127008000,
127094400, 127180800, 127267200, 127353600, 127440000, 127526400,
127612800, 127699200, 127785600, 127872000, 127958400, 128044800,
128131200, 128217600, 128304000, 128390400, 128476800, 128563200,
128649600, 128736000, 128822400, 128908800, 128995200, 129081600,
129168000, 129254400, 129340800, 129427200, 129513600, 129600000,
129686400, 129772800, 129859200, 129945600, 130032000, 130118400,
130204800, 130291200, 130377600, 130464000, 130550400, 130636800,
130723200, 130809600, 130896000, 130982400, 131068800, 131155200,
131241600, 131328000, 131414400, 131500800, 131587200, 131673600,
131760000, 131846400, 131932800, 132019200, 132105600, 132192000,
132278400, 132364800, 132451200, 132537600, 132624000, 132710400,
132796800, 132883200, 132969600, 133056000, 133142400, 133228800,
133315200, 133401600, 133488000, 133574400, 133660800, 133747200,
133833600, 133920000, 134006400, 134092800, 134179200, 134265600,
134352000, 134438400, 134524800, 134611200, 134697600, 134784000,
134870400, 134956800, 135043200, 135129600, 135216000, 135302400,
135388800, 135475200, 135561600, 135648000, 135734400, 135820800,
135907200, 135993600, 136080000, 136166400, 136252800, 136339200,
136425600, 136512000, 136598400, 136684800, 136771200, 136857600,
136944000, 137030400, 137116800, 137203200, 137289600, 137376000,
137462400, 137548800, 137635200, 137721600, 137808000, 137894400,
137980800, 138067200, 138153600, 138240000, 138326400, 138412800,
138499200, 138585600, 138672000, 138758400, 138844800, 138931200,
139017600, 139104000, 139190400, 139276800, 139363200, 139449600,
139536000, 139622400, 139708800, 139795200, 139881600, 139968000,
140054400, 140140800, 140227200, 140313600, 140400000, 140486400,
140572800, 140659200, 140745600, 140832000, 140918400, 141004800,
141091200, 141177600, 141264000, 141350400, 141436800, 141523200,
141609600, 141696000, 141782400, 141868800, 141955200, 142041600,
142128000, 142214400, 142300800, 142387200, 142473600, 142560000,
142646400, 142732800, 142819200, 142905600, 142992000, 143078400,
143164800, 143251200, 143337600, 143424000, 143510400, 143596800,
143683200, 143769600, 143856000, 143942400, 144028800, 144115200,
144201600, 144288000, 144374400, 144460800, 144547200, 144633600,
144720000, 144806400, 144892800, 144979200, 145065600, 145152000,
145238400, 145324800, 145411200, 145497600, 145584000, 145670400,
145756800, 145843200, 145929600, 146016000, 146102400, 146188800,
146275200, 146361600, 146448000, 146534400, 146620800, 146707200,
146793600, 146880000, 146966400, 147052800, 147139200, 147225600,
147312000, 147398400, 147484800, 147571200, 147657600, 147744000,
147830400, 147916800, 148003200, 148089600, 148176000, 148262400,
148348800, 148435200, 148521600, 148608000, 148694400, 148780800,
148867200, 148953600, 149040000, 149126400, 149212800, 149299200,
149385600, 149472000, 149558400, 149644800, 149731200, 149817600,
149904000, 149990400, 150076800, 150163200, 150249600, 150336000,
150422400, 150508800, 150595200, 150681600, 150768000, 150854400,
150940800, 151027200, 151113600, 151200000, 151286400, 151372800,
151459200, 151545600, 151632000, 151718400, 151804800, 151891200,
151977600, 152064000, 152150400, 152236800, 152323200, 152409600,
152496000, 152582400, 152668800, 152755200, 152841600, 152928000,
153014400, 153100800, 153187200, 153273600, 153360000, 153446400,
153532800, 153619200, 153705600, 153792000, 153878400, 153964800,
154051200, 154137600, 154224000, 154310400, 154396800, 154483200,
154569600, 154656000, 154742400, 154828800, 154915200, 155001600,
155088000, 155174400, 155260800, 155347200, 155433600, 155520000,
155606400, 155692800, 155779200, 155865600, 155952000, 156038400,
156124800, 156211200, 156297600, 156384000, 156470400, 156556800,
156643200, 156729600, 156816000, 156902400, 156988800, 157075200,
157161600, 157248000, 157334400, 157420800, 157507200, 157593600,
157680000), y = c(4.65, 4.65, 4.65, 4.48, 5.16, 5.52, 5.34, 5.34,
4.82, 4.65, 4.48, 4.31, 4.31, 4.31, 4.14, 3.82, 3.98, 3.98, 4.31,
5.71, 6.5, 6.3, 5.71, 5.71, 5.16, 4.65, 4.14, 3.98, 4.48, 4.48,
4.31, 4.65, 4.31, 3.98, 3.98, 3.98, 3.98, 3.98, 3.98, 3.82, 3.67,
3.67, 3.98, 3.98, 3.82, 3.82, 3.82, 4.14, 5.9, 4.48, 3.98, 3.98,
3.82, 3.67, 3.67, 3.67, 4.65, 3.98, 4.31, 4.31, 3.67, 4.31, 6.1,
7.3, 7.5, 7.5, 8.14, 10.8, 16.1, 14.8, 12.5, 9.9, 8.14, 6.9,
6.1, 5.34, 5.16, 4.99, 4.99, 4.99, 4.99, 5.52, 6.3, 7.3, 6.9,
5.9, 5.71, 5.71, 8.58, 31.5, 33.7, 18.4, 11.3, 16.1, 32.9, 45.3,
54, 25.7, 18, 15.9, 15.6, 14.5, 15.9, 35.9, 37.5, 29.4, 27.5,
30.1, 27.5, 30.8, 29.4, 22, 20.1, 35.9, 36.7, 32.9, 22, 18, 15.9,
15.2, 14.8, 13, 12.7, 12.5, 11, 9.68, 8.8, 7.92, 7.3, 6.9, 7.3,
10.3, 11, 11.3, 11.9, 12.5, 13.6, 12.2, 10.8, 9.9, 9.46, 8.8,
7.5, 7.1, 7.71, 7.1, 6.1, 5.34, 5.34, 5.34, 5.52, 5.52, 6.3,
6.7, 6.5, 5.9, 5.71, 5.9, 5.71, 5.52, 7.3, 7.5, 7.1, 7.3, 6.7,
6.9, 7.3, 7.5, 10.8, 11.6, 8.58, 7.92, 7.1, 6.7, 6.5, 6.1, 5.9,
5.9, 5.71, 5.52, 5.52, 5.52, 5.9, 5.9, 5.71, 5.52, 5.52, 5.34,
5.34, 5.52, 6.5, 6.5, 5.71, 5.34, 5.16, 4.99, 4.82, 4.82, 4.99,
4.82, 4.82, 4.82, 4.82, 4.82, 4.65, 4.48, 4.48, 4.31, 4.31, 4.14,
4.14, 4.31, 4.48, 4.31, 4.31, 4.31, 4.99, 5.71, 6.3, 6.1, 6.1,
5.9, 5.71, 5.52, 5.52, 5.52, 5.52, 5.52, 5.34, 5.34, 5.52, 5.52,
5.52, 5.34, 5.34, 5.52, 5.34, 5.52, 5.52, 5.34, 5.34, 5.34, 5.34,
5.71, 5.9, 6.3, 6.9, 7.5, 6.5, 6.1, 6.1, 5.9, 6.1, 6.1, 5.9,
6.5, 6.5, 6.1, 5.9, 5.9, 5.71, 5.9, 5.9, 5.71, 4.99, 4.65, 5.16,
5.34, 5.34, 4.65, 4.99, 5.71, 5.34, 5.34, 5.34, 5.34, 4.99, 5.34,
5.34, 5.34, 5.34, 5.52, 5.34, 5.52, 5.71, 6.5, 7.71, 6.9, 6.5,
6.7, 6.1, 5.9, 6.1, 5.9, 5.71, 7.92, 7.71, 7.1, 7.92, 5.34, 5.16,
8.14, 10.1, 7.92, 7.3, 6.9, 6.9, 6.9, 8.58, 7.3, 6.9, 7.3, 6.3,
5.16, 6.1, 5.52, 4.99, 5.34, 5.34, 5.34, 5.16, 5.71, 5.52, 5.52,
5.16, 4.82, 5.52, 6.1, 5.9, 5.71, 5.52, 5.16, 4.99, 4.48, 4.82,
5.16, 5.16, 5.16, 5.16, 5.16, 4.82, 4.65, 3.82, 4.14, 4.65, 4.65,
4.31, 4.31, 5.16, 5.16, 5.16, 5.16, 5.16, 4.99, 4.65, 5.16, 5.16,
5.16, 5.16, 5.16, 5.16, 5.16, 5.16, 5.34, 5.34)), .Names = c("x",
"y"), row.names = c(NA, -365L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000005860788>)
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 readxl_0.1.0 mgcv_1.8-7 nlme_3.1-121 scales_0.3.0 sos_1.3-8 brew_1.0-6 ggplot2_1.0.1
[9] MASS_7.3-43
loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 lattice_0.20-33 digest_0.6.8 chron_2.3-47 grid_3.2.2 plyr_1.8.3 gtable_0.1.2 magrittr_1.5
[9] stringi_0.5-5 reshape2_1.4.1 Matrix_1.2-2 labeling_0.3 proto_0.3-10 tools_3.2.2 stringr_1.0.0 munsell_0.4.2
[17] colorspace_1.2-6
Partial Solution 部分解决方案
I still don't really know why the two methods are giving different answers, and that bothers me. 我仍然不知道为什么这两种方法给出了不同的答案,这让我感到困扰。 However, after much internet searching, I did find the following workaround:
然而,经过大量的互联网搜索,我确实找到了以下解决方法:
library(readxl)
library(data.table)
library(ggplot2)
library(scales)
library(mgcv)
stackOF_data <- read_excel("C:/Users/jel4049/Desktop/mean-daily-flow-cumecs- vatnsdals.xlsx", sheet = "Data")
stackOF_data <- data.table(stackOF_data)
stackOF_data <- stackOF_data[,.(timeseries=as.POSIXct(Date,format("%Y-%m-%d")),mdf)]
a <- stackOF_data[,.(x=as.numeric(timeseries),y=mdf)]
a1 <- gam(y~s(x, k=100, bs="cs"),data=a)
a2=data.table(gam_mdf = predict(a1,a))
preds <- predict(a1,se.fit=TRUE)
my_data <- data.frame(mu=preds$fit, low =(preds$fit - 1.96 * preds$se.fit), high = (preds$fit + 1.96 * preds$se.fit))
m <- ggplot()+
geom_line(data = a2, aes(x=stackOF_data$timeseries, y=gam_mdf), size=1, col="blue")+
geom_smooth(data=my_data,aes(ymin = low, ymax = high, x=stackOF_data$timeseries, y = mu), stat = "identity", col="green")
m
Now at least I know that the summary and data fit quality info I can get from some of the mgcv functions match my plots. 现在至少我知道我可以从一些mgcv函数得到的摘要和数据拟合质量信息与我的图匹配。
It turns out the differences you were seeing was because you were using the default for the n
argument in stat_smooth
. 事实证明,您所看到的差异是因为您在
stat_smooth
中使用n
参数的默认值。
From the help page: 从帮助页面:
n number of points to evaluate smoother at
n点数要评价更平滑
Of course, it didn't jump out at me right away that this meant n
controls the size of the dataset for the newdata
argument in predict
and therefore stat_smooth
doesn't use the original dataset when making the predictions. 当然,它并没有马上跳出来,这意味着
n
控制predict
newdata
参数的数据集大小,因此stat_smooth
在进行预测时不使用原始数据集。 But I was reading this nice answer on a different stat_smooth
question and realized that to figure out what was going on I should take a closer look at the stat_smooth
predictions vs manual predictions from a fitted gam
model. 但我在一个不同的
stat_smooth
问题上阅读这个很好的答案 ,并意识到要弄清楚发生了什么,我应该仔细看看stat_smooth
预测与来自拟合gam
模型的手动预测。
So, using your dataset from your OP, which I named dat
, we can check what's going on. 因此,使用您命名为
dat
OP中的数据集,我们可以检查发生了什么。
The plot when k = 100
, after fitting the model via gam
and adding the predictions to the dataset. 在通过
gam
拟合模型并将预测添加到数据集之后k = 100
时的图。 As you noted, the blue ( stat_smooth
) and black (manual predictions) don't match. 如您所述,蓝色(
stat_smooth
)和黑色(手动预测)不匹配。
dat$predgam = predict(gam(y ~ s(x, k = 100), data = dat))
(p1 = ggplot(dat, aes(x, y)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, k = 100)) +
geom_line(aes(y = predgam)))
You can always use ggplot_build
to look at your plot object and see all the pieces that make it up (I'm not showing the results here because it takes up so much space, but the output will print to your Console). 你总是可以使用
ggplot_build
来查看你的绘图对象并查看构成它的所有部分(我在这里没有显示结果,因为它占用了太多空间,但输出将打印到你的控制台)。
ggplot_build(p1)
The prediction dataset for stat_smooth
is the second in the list of datasets. stat_smooth
的预测数据集是数据集列表中的第二个。
ggplot_build(p1)$data[[2]]
But look how many rows that dataset has: 但是看看数据集有多少行:
nrow(ggplot_build(p1)$data[[2]])
[1] 80
The default setting for the n
argument is 80, but you have 365 rows in your dataset. n
参数的默认设置为80,但数据集中有365行。 So what happens if you change n
to 365? 那么如果你将
n
改为365会发生什么? I'll make the smooth line fatter so you can actually see it (in blue). 我会让平滑的线条更加丰富,所以你可以真正看到它(蓝色)。
(p2 = ggplot(dat, aes(x, y)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x, k = 100), n = 365, size = 2) +
geom_line(aes(y = predgam)))
nrow(ggplot_build(p2)$data[[2]])
[1] 365
If you look at the code for the predictdf
function mentioned in the Details section of the stat_smooth
help page you'll see that the original dataset isn't used when making predictions. 如果查看
stat_smooth
帮助页面的“详细信息”部分中提到的predictdf
函数的代码,您将看到在进行预测时未使用原始数据集。 Instead, a sequence is made from the original explanatory variable. 相反,序列由原始解释变量构成。 This is something that can be really important when working with a small dataset and you need more prediction points in order for the line to look smooth.
在处理小型数据集时,这是非常重要的,您需要更多预测点才能使线条看起来平滑。 In your case, though, the original dataset is already a nice smooth sequence of
x
so using n = 365
gets the same predictions from stat_smooth
as the original dataset does. 但是,在您的情况下,原始数据集已经是一个很好的平滑
x
序列,因此使用n = 365
可以获得与原始数据集相同的stat_smooth
预测。
You can see the code for predictdf
here . 你可以在这里看到
predictdf
的代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.