简体   繁体   English

线性回归[R]:如何根据分类变量的出现为同一预测变量计算多个系数

[英]Linear regression [R]: how to calculate multiple coefficients for the same predictor based on the occurrence of a categorical variable

I have a forecasting with linear regression problem. 我有线性回归问题的预测。 In this problem the days of the week matter. 在这个问题中,星期几很重要。 At the moment I use: 目前,我使用:

lm.mod <- lm(y ~ x + monday + tuesday + thursday + friday + saturday + sunday, data=train)

Where y and x are continuous variables and the days of the week are dummy variables (they can be either 0 or 1). 其中y和x是连续变量,而星期几是伪变量(它们可以是0或1)。 In this way the week dependence is given by different intercepts (the coefficients in front of the dummies). 通过这种方式,周依赖性由不同的截距(假人前面的系数)给出。 However, I would like to calculate a different coefficient for x for each day of the week. 但是,我想为一周中的每一天计算x的不同系数。

I can do this operation when I use GAM (library: mgcv) inside the spline function, where "day" is a categorical variable containing the name of the day of the week 当我在样条函数中使用GAM(库:mgcv)时,可以执行此操作,其中“ day”是包含星期几名称的分类变量

gam.mod <- mgcv::gam(y ~ s(x, bs='cs', by=day) + monday + tuesday + thursday + friday + saturday + sunday, data = train, method="REML", select=TRUE)

I show a few lines of the data frame train 我展示了几行数据框序列

Date        | y          | x          | day       | Monday | Tuesday | Wednesday |
---------------------------------------------------------------------------------
2013-01-01  | 0.87604858 | 0.07339450 | Tuesday   | 0      | 1       | 0         |
2013-01-02  | 0.90190414 | 0.16513761 | Wednesday | 0      | 0       | 1         |

With mgcv I obtain a different spline for each day of the week (each value of the factor variable "day"), with a linear model I would like to obtain many coefficients for x as the number of values of factor variables. 使用mgcv,我在一周的每一天(因子变量“ day”的每个值)获得不同的样条曲线,对于线性模型,我希望获得x的许多系数作为因子变量的值数量。 Is it possible? 可能吗? Any workaround? 任何解决方法?

Maybe I'm missing something, but it appears to me you are asking for the interaction between x and the week days? 也许我错过了一些东西,但是在我看来,您是在要求x与工作日之间的互动?

Ie simplified a bit, something like this: 即简化了一点,像这样:

# Toy data
n <- 100
train <- data.frame(replicate(5, rnorm(n)))
names(train) <- c("x", "y", "mon", "tue", "wed")

lm.mod <- lm(y ~ x*(mon + tue + wed), data=train)

You want to avoid creating the binary terms yourself. 您要避免自己创建二进制项。 In fact, the way the mgcv notation implies a spline by day , you ant to include day as a factor in the model, not all those separate terms. 实际上, mgcv表示法按day表示样条的方式,您希望将day作为因素包括在模型中,而不是将所有这些单独的术语包括在内。

So, the gam model would be: 因此, gam模型将是:

gam(y ~ s(x, bs='cs', by=day) + day, data = train, method="REML", select=TRUE)

where day is a factor with levels c('Monday','Tuesday', ....) . 其中day是水平c('Monday','Tuesday', ....)的因子。

Then the linear model becomes: 然后,线性模型变为:

lm(y ~ x * day, data = train)

You have to work a little harder to get the estimated means for each day; 您必须付出更多的努力才能获得每天的估计收入。 use predict() for the gam() model with newdata and one row per day and type = 'terms' and you can add the intercept to the day contribution (effect). 对带有newdatagam()模型使用predict()和每天一行, type = 'terms'然后可以将截距添加到day贡献(效果)中。 For the lm() model you can most easily do this using the multcomp package. 对于lm()模型,您可以使用multcomp包最轻松地做到这一点。

You could also just drop the intercept (add + 0 to the model formula). 您也可以放下截距(在模型公式中添加+ 0 )。 There are other ways to potentially parameterise the model to model easily give you the estimates you may want. 还有其他方法可以对模型进行参数化以轻松建模,从而为您提供所需的估计。

That your models are even fitting is because R internally is dropping some effects; 您的模型是否合适,是因为R在内部正在降低某些效果; you can't fit an intercept and all those day terms because one of the separate day variables is linearly dependent on the intercept and thus cannot be uniquely identified. 您不能使用截距和所有当日条款,因为单独的日期变量之一线性依赖于截距,因此无法唯一标识。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 线性回归中的R分类变量 - R categorical variable in Linear Regression 在带有分类变量的 r output 中使用 lm() 的多元线性回归不完整? - Multiple Linear Regression using lm() in r output with categorical variable is incomplete? 基于r中分类变量的线性回归 - Linear regression based upon categorical variables in r R中多元线性回归模型中预测变量影响的估计量方差估计 - estimate of the variance of estimator for the effect of a predictor variable in a multiple linear regression model in R 多元线性回归中的R数字和类别变量 - R numeric and categorical variables in multiple linear regression 多元线性回归中的 R 分类 IV - R categorical IV in multiple linear regression Matlab / R-具有分类和连续预测变量的线性回归-为什么连续预测变量平方? - Matlab/R - linear regression with categorical & continuous predictors - why is the continuous predictor squared? GLM - 使用分类预测器运行简单线性回归时没有 R 平方输出 - GLM - No R-squared output when running simple linear regression with categorical predictor 线性回归 model 与 R 中的虚拟(因)变量和分类(独立)变量 - Linear Regression model with dummy (dependent) variable and categorical (independent) variable in R 有什么方法可以计算 lm() 多元线性回归模型中预测变量的 f 平方(作为效应大小)? - is there any way to calculate f-squared (as an effect size) for a predictor within an lm() multiple linear regression model?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM