mgcv gam（）错误：模型的系数比数据多

Question

I am using GAM (generalized additive models) for my dataset. 我正在为我的数据集使用GAM （广义加性模型）。 This dataset has 32 observations, with 6 predictor variables and a response variable (namely power). 该数据集具有32个观测值，其中包含6个预测变量和一个响应变量（即幂）。 I am using gam() function of the mgcv package to fit the models. 我正在使用mgcv软件包的gam()函数来拟合模型。 Whenever, I try to fit a model I do get an error message as: 每当我尝试拟合模型时，都会收到以下错误消息：

Error in gam(formula.hh, data = data, na.action = na.exclude,  : 
  Model has more coefficients than data

From this error message, I infer that I have more predictor variables as compared to the number of observations. 从该错误消息中，我推断出与观察数相比，我有更多的预测变量。 I guess this error is generated during cross-validation procedures. 我猜这个错误是在交叉验证过程中产生的。 Is there any way to handle this error? 有什么办法可以解决这个错误？

I am using following code for this, 我为此使用以下代码，

library(mgcv)
formula.hh <- as.formula(power ~ s(temperature) 
                                + s(prevday1) + s(prevday2)
                                + s(prev_2_hour) + s(prev_instant1))
model <- gam(formula.hh, data = data, na.action = na.exclude)

Here, I am attaching the data with dput() function 在这里，我使用dput()函数附加数据

> dput(data)
data <- structure(list(power = c(250.615931666667, 252.675878333333, 
1578.209605, 186.636575166667, 1062.07912666667, 1031.481235, 
1584.38902166667, 276.973836666667, 401.620463333333, 1622.50827666667, 
273.825153333333, 1511.37474333333, 291.460865, 215.138178333333, 
247.509348333333, 1140.21383833333, 1680.63441666667, 1742.44168333333, 
592.162706166667, 1610.7307, 615.857495, 1664.13551, 464.973065, 
1956.2482, 1767.94469333333, 1869.02678333333, 1806.731, 1746.3731, 
549.216605, 1425.42390166667, 1900.32575, 1766.18103333333), 
    temperature = c(31, 30, 28, 28, 27, 31, 32, 32, 30.5, 33, 
    33, 30, 32, 24, 30, 26, 28, 32, 34, 25, 32, 33, 35, 36, 36, 
    37, 35, 33, 35, 33, 35, 32), prevday1 = c(NA, 250.615931666667, 
    252.675878333333, 1578.209605, 186.636575166667, 1062.07912666667, 
    1031.481235, 1584.38902166667, 276.973836666667, 401.620463333333, 
    1622.50827666667, 273.825153333333, 1511.37474333333, 291.460865, 
    215.138178333333, 247.509348333333, 1140.21383833333, 1680.63441666667, 
    1742.44168333333, 592.162706166667, 1610.7307, 615.857495, 
    1664.13551, 464.973065, 1956.2482, 1767.94469333333, 1869.02678333333, 
    1806.731, 1746.3731, 549.216605, 1425.42390166667, 1900.32575
    ), prevday2 = c(NA, NA, 250.615931666667, 252.675878333333, 
    1578.209605, 186.636575166667, 1062.07912666667, 1031.481235, 
    1584.38902166667, 276.973836666667, 401.620463333333, 1622.50827666667, 
    273.825153333333, 1511.37474333333, 291.460865, 215.138178333333, 
    247.509348333333, 1140.21383833333, 1680.63441666667, 1742.44168333333, 
    592.162706166667, 1610.7307, 615.857495, 1664.13551, 464.973065, 
    1956.2482, 1767.94469333333, 1869.02678333333, 1806.731, 
    1746.3731, 549.216605, 1425.42390166667), prev_instant1 = c(NA, 
    237.211388333333, 455.932271666667, 367.837349666667, 1230.40137333333, 
    1080.74080166667, 1898.06056666667, 326.103031666667, 302.770571666667, 
    1859.65283333333, 281.700161666667, 1684.32288333333, 291.448878333333, 
    214.838578333333, 254.042623333333, 1380.14074333333, 824.437228333333, 
    1660.46284666667, 268.004111666667, 1715.02763333333, 1853.08503333333, 
    1821.31845, 1173.91945333333, 1859.87353333333, 1887.67635, 
    1760.29563333333, 1876.05421666667, 1743.10665, 366.382048333333, 
    1185.16379, 1713.98534666667, 1746.36006666667), prev_instant2 = c(NA, 
    275.55167, 242.638122833333, 220.635857, 1784.77271666667, 
    1195.45020333333, 590.114391666667, 310.141536666667, 1397.3184605, 
    1747.44398333333, 260.10318, 1521.77355833333, 283.317726666667, 
    206.678135, 231.428693833333, 235.600631666667, 232.455201666667, 
    281.422625, 256.470893333333, 1613.82088333333, 1564.34841666667, 
    1795.03498333333, 1551.64725666667, 1517.69289833333, 1596.66556166667, 
    2767.82433333333, 2949.38005, 328.691775, 389.83789, 1805.71815333333, 
    1153.97645666667, 1752.75968333333), prev_2_hour = c(NA, 
    219.024983, 313.393630708333, 263.748829166667, 931.193606666667, 
    699.399163791667, 754.018962083334, 272.22309625, 595.954508875, 
    1597.21487208333, 512.64361, 1236.42579666667, 281.200373333334, 
    196.983981666666, 230.327737625, 525.483920416666, 391.120302791667, 
    610.101280416667, 247.710625543785, 978.741044166665, 979.658926666667, 
    1189.25306041667, 814.840889166667, 989.059700416665, 1352.2367025, 
    1770.20417833333, 1847.11590666667, 843.191556416666, 363.50806625, 
    904.924465041666, 841.746712500002, 1747.73452958333)), .Names = c("power", 
"temperature", "prevday1", "prevday2", "prev_instant1", "prev_instant2", 
"prev_2_hour"), class = "data.frame", row.names = c(NA, 32L))

Answer 1

This dataset has 32 observations. 该数据集具有32个观察值。

Actually, only 30 as two rows have NA . 实际上，只有30个两行具有NA 。

From this error message, I infer that I have more predictor variables as compared to the number of observations. 从该错误消息中，我推断出与观察数相比，我有更多的预测变量。

Yes. 是。 By default, the s() choose basis dimension (or rank) to be 10 for 1D smoother, giving 10 raw parameters. 默认情况下，对于一维平滑器而言， s()选择基本尺寸（或等级）为10，并提供10个原始参数。 After centering constraint (see ?identifiability ) you get one fewer parameter, but you still have 9 parameters for each smooth. 在居中约束之后（请参见?identifiability ），您得到的参数减少了一个，但每个平滑度仍然有9个参数。 Note that you have 5 smooths! 请注意，您有5个平滑！ So you have 45 parameters for smooth terms, plus an intercept. 因此，您有45个用于平滑项的参数以及一个截距。 This is greater than your 30 data. 这大于您的30个数据。

I guess this error is generated during cross-validation procedures. 我猜这个错误是在交叉验证过程中产生的。

No. This error is detected as soon as GAM formula has been interpreted and model frame been constructed. 不会。一旦解释了GAM公式并构建了模型框架，就可以检测到此错误。 Even before real basis construction we can already know what is n (number of data) and what is p (number of parameters). 甚至在进行实数基础构建之前，我们就已经知道什么是n （数据数量）和什么是p （参数数量）。

Is there any way to handle this error? 有什么办法可以解决这个错误？

Reduce k manually rather than using default. 手动减少k而不是使用默认值。 However for cubic spline the minimum k is 3. For example, use s(temperature, bs = 'cr', k = 3) . 但是，对于三次样条曲线，最小值k为3。例如，使用s(temperature, bs = 'cr', k = 3) 。 Note I have also set bs = 'cr' to use natural cubic spline, not the default bs = 'tp' for thin-plate regression spline. 注意，我还设置了bs = 'cr'以使用自然三次样条，而不是薄板回归样条的默认bs = 'tp' 。 You can use it of course, but for 1D smooth "cr" is a natural choice. 您当然可以使用它，但是对于一维平滑"cr"是很自然的选择。

mgcv gam（）错误：模型的系数比数据多

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-05-18 07:39:40

mgcv gam（）错误：模型的系数比数据多

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-05-18 07:39:40

解决方案1
4 已采纳 2017-05-18 07:39:40