[英]Linear regression [R]: how to calculate multiple coefficients for the same predictor based on the occurrence of a categorical variable
I have a forecasting with linear regression problem. 我有线性回归问题的预测。 In this problem the days of the week matter. 在这个问题中,星期几很重要。 At the moment I use: 目前,我使用:
lm.mod <- lm(y ~ x + monday + tuesday + thursday + friday + saturday + sunday, data=train)
Where y and x are continuous variables and the days of the week are dummy variables (they can be either 0 or 1). 其中y和x是连续变量,而星期几是伪变量(它们可以是0或1)。 In this way the week dependence is given by different intercepts (the coefficients in front of the dummies). 通过这种方式,周依赖性由不同的截距(假人前面的系数)给出。 However, I would like to calculate a different coefficient for x for each day of the week. 但是,我想为一周中的每一天计算x的不同系数。
I can do this operation when I use GAM (library: mgcv) inside the spline function, where "day" is a categorical variable containing the name of the day of the week 当我在样条函数中使用GAM(库:mgcv)时,可以执行此操作,其中“ day”是包含星期几名称的分类变量
gam.mod <- mgcv::gam(y ~ s(x, bs='cs', by=day) + monday + tuesday + thursday + friday + saturday + sunday, data = train, method="REML", select=TRUE)
I show a few lines of the data frame train 我展示了几行数据框序列
Date | y | x | day | Monday | Tuesday | Wednesday |
---------------------------------------------------------------------------------
2013-01-01 | 0.87604858 | 0.07339450 | Tuesday | 0 | 1 | 0 |
2013-01-02 | 0.90190414 | 0.16513761 | Wednesday | 0 | 0 | 1 |
With mgcv I obtain a different spline for each day of the week (each value of the factor variable "day"), with a linear model I would like to obtain many coefficients for x as the number of values of factor variables. 使用mgcv,我在一周的每一天(因子变量“ day”的每个值)获得不同的样条曲线,对于线性模型,我希望获得x的许多系数作为因子变量的值数量。 Is it possible? 可能吗? Any workaround? 任何解决方法?
Maybe I'm missing something, but it appears to me you are asking for the interaction between x and the week days? 也许我错过了一些东西,但是在我看来,您是在要求x与工作日之间的互动?
Ie simplified a bit, something like this: 即简化了一点,像这样:
# Toy data
n <- 100
train <- data.frame(replicate(5, rnorm(n)))
names(train) <- c("x", "y", "mon", "tue", "wed")
lm.mod <- lm(y ~ x*(mon + tue + wed), data=train)
You want to avoid creating the binary terms yourself. 您要避免自己创建二进制项。 In fact, the way the mgcv notation implies a spline by day
, you ant to include day
as a factor in the model, not all those separate terms. 实际上, mgcv表示法按day
表示样条的方式,您希望将day
作为因素包括在模型中,而不是将所有这些单独的术语包括在内。
So, the gam
model would be: 因此, gam
模型将是:
gam(y ~ s(x, bs='cs', by=day) + day, data = train, method="REML", select=TRUE)
where day
is a factor with levels c('Monday','Tuesday', ....)
. 其中day
是水平c('Monday','Tuesday', ....)
的因子。
Then the linear model becomes: 然后,线性模型变为:
lm(y ~ x * day, data = train)
You have to work a little harder to get the estimated means for each day; 您必须付出更多的努力才能获得每天的估计收入。 use predict()
for the gam()
model with newdata
and one row per day and type = 'terms'
and you can add the intercept to the day
contribution (effect). 对带有newdata
的gam()
模型使用predict()
和每天一行, type = 'terms'
然后可以将截距添加到day
贡献(效果)中。 For the lm()
model you can most easily do this using the multcomp package. 对于lm()
模型,您可以使用multcomp包最轻松地做到这一点。
You could also just drop the intercept (add + 0
to the model formula). 您也可以放下截距(在模型公式中添加+ 0
)。 There are other ways to potentially parameterise the model to model easily give you the estimates you may want. 还有其他方法可以对模型进行参数化以轻松建模,从而为您提供所需的估计。
That your models are even fitting is because R internally is dropping some effects; 您的模型是否合适,是因为R在内部正在降低某些效果; you can't fit an intercept and all those day terms because one of the separate day variables is linearly dependent on the intercept and thus cannot be uniquely identified. 您不能使用截距和所有当日条款,因为单独的日期变量之一线性依赖于截距,因此无法唯一标识。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.