R中具有二分预测变量的多元线性回归：伪代码还是让R处理？

Question

I am running a multiple linear regression for a course using R. One of my predictor variables that I want to include in the model is the sex of the individual coded as "m" and "f". 我正在使用R对课程进行多元线性回归。我要包含在模型中的预测变量之一是编码为“ m”和“ f”的个体的性别。 I ran the model in R two different ways: 我用两种不同的方式在R中运行模型：

Model 1: "Sex" as the original categorical variable R 模型1：将 “性”作为原始分类变量R

lm(formula = P_iP_Choice ~ Sex + Carapace + Competitor_Presence_BI + 
    PSI_Day1_Choice + AGG_AVERAGE, data = pano2014)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55241 -0.12879 -0.04414  0.13769  0.67394 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -0.43031    0.23872  -1.803 0.074353 .  
Sexm                   -0.28566    0.04685  -6.098 1.86e-08 ***
Carapace                0.15558    0.04534   3.431 0.000863 ***
Competitor_Presence_BI -0.03339    0.04532  -0.737 0.462870    
PSI_Day1_Choice         0.15825    0.13029   1.215 0.227273    
AGG_AVERAGE             0.15406    0.07790   1.978 0.050604 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2375 on 104 degrees of freedom
Multiple R-squared: 0.3146, Adjusted R-squared: 0.2817 
F-statistic: 9.549 on 5 and 104 DF,  p-value: 1.611e-07

Model 2: Sex of the individuals as a different variable "Female" which was coded 0=males 1=females. 模型2：将个体的性别作为不同的变量“女性”，其编码为0 =男性1 =女性。

lm(formula = P_iP_Choice ~ Female + Carapace + Competitor_Presence_BI + 
    PSI_Day1_Choice + AGG_AVERAGE, data = pano2014)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.55241 -0.12879 -0.04414  0.13769  0.67394 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -0.71597    0.24498  -2.923 0.004260 ** 
Female                  0.28566    0.04685   6.098 1.86e-08 ***
Carapace                0.15558    0.04534   3.431 0.000863 ***
Competitor_Presence_BI -0.03339    0.04532  -0.737 0.462870    
PSI_Day1_Choice         0.15825    0.13029   1.215 0.227273    
AGG_AVERAGE             0.15406    0.07790   1.978 0.050604 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2375 on 104 degrees of freedom
Multiple R-squared: 0.3146, Adjusted R-squared: 0.2817 
F-statistic: 9.549 on 5 and 104 DF,  p-value: 1.611e-07

My understanding is that the difference in the coefficient of the intercept term is caused because in Model 1 R dummy-coded my categorical variable into a dichotomous variable and the variation in my response variable explained by "females" gets added to the intercept because R coded "Sex" using males. 我的理解是，截距项系数的差异是由于在模型1中R将我的分类变量伪编码为二分变量，而由“女性”解释的我的响应变量的变化被添加到了截距中，因为R编码“性”使用男性。 However in Model 2 the difference in running the model does not change the estimates of coefficients for other parts of my model. 但是，在模型2中，运行模型的差异不会更改模型其他部分的系数估计。

What I would like to know is what is the "correct" or widely accepted method of using dichotomous categorical variables in linear models? 我想知道的是在线性模型中使用二分类类别变量的“正确”或广泛接受的方法是什么？ Dummy coding them yourself? 假人自己编码？ Or letting R dummy code it? 还是让R虚拟代码呢？

Answer 1

Either way is correct (assuming you do the manual coding properly), but there is a but. 两种方法都是正确的（假设您正确地进行了手动编码），但是有一个。 R supports several coding schemes (contrasts) for categorical variables. R支持针对分类变量的几种编码方案（对比度）。 There is Dummy Coding, Deviation Coding, Helmert Coding, etc. What changes in each of these schemes is the meaning of intercept, and the interpretation of parameters. 有虚拟编码，偏差编码，Helmert编码等。这些方案中的每一个发生的变化都是截距的含义和参数的解释。 For instance, with dummy coding you compare all categories against a single base category, and the intercept is the mean for the base category (all other parameters being zero). 例如，使用伪编码，您可以将所有类别与单个基本类别进行比较，截距是该基本类别的平均值（所有其他参数均为零）。 With deviation coding, you intercept is the grand (!) mean, and your parameters are deviations from this grand mean. 使用偏差编码，您截取的是均值（！），而您的参数就是该均值的偏差。 For example, if you are conducting country analysis, it is not always useful to compare every country against, say, France. 例如，如果您要进行国家/地区分析，那么将每个国家/地区与例如法国进行比较并不总是有用的。 Instead, you might want to compare each country to some mean, say, for the European Union. 取而代之的是，您可能希望将每个国家/地区与欧盟的平均值进行比较。

This also goes for dichotomous variables. 对于二分变量也是如此。 Do you want to compare men to women, or would you rather compare men to grand mean, and women to grand mean? 您想将男人与女人进行比较，还是宁愿将男人与中庸之举，女人与中庸之举进行比较？ Both are feasible, depending on your research context. 两者都是可行的，具体取决于您的研究背景。

Now, when you use manual coding, you make no error. 现在，当您使用手动编码时，不会出错。 Yet you cannot quickly switch from one coding system to another, you'll have to recode everything manually again. 但是，您不能快速从一种编码系统切换到另一种，因此必须再次手动重新编码所有内容。 For more complex coding systems you'll have some chance to make a mistake by doing it manually. 对于更复杂的编码系统，您将有机会通过手动操作来犯错。 And this may not matter much for dichotomous variables, but if you have more categories, creating dummies manually will clutter up your dataset and may result in confusion when you return to your analysis in a few months. 对于二分变量来说，这可能无关紧要，但是如果您有更多类别，则手动创建虚拟变量会使您的数据集混乱，并且在几个月后返回分析时可能会造成混乱。 Just a few arguments to use the automatic coding. 只需几个参数即可使用自动编码。

You can find additional information on coding systems in R here . 你可以找到关于R中的编码系统的其他信息在这里。 It is a useful read and gives you more flexibility within the context of regression. 这是一本很有用的读物，并在回归的上下文中为您提供了更大的灵活性。 Good luck! 祝好运！

Answer 2

Just to expand a bit on @BenBolker's comment. 只是为了扩展@BenBolker的评论。

In your first model, R takes Sex=F as the baseline and reports that the intercept is -0.43031. 在您的第一个模型中，R将Sex=F作为基线，并报告截距为-0.43031。 If Sex=M the whole model is shifted by -0.28566 (the coefficient of Sexm). 如果Sex=M则整个模型偏移-0.28566（Sexm的系数）。 So Sexm is not the impact of males, it is the difference between the models when Sex=F and Sex=M . 因此Sexm 不是男性的影响，而是Sex=F和Sex=M时模型之间的差异。 None of the other parameters are affected by this because you have linear model with no interactions. 其他参数均不受此影响，因为您具有没有交互作用的线性模型。 So when Sex=M you would have an identical model, but with the intercept being -0.43031 + (-0.28566) = -0.71597. 因此，当Sex=M您将具有相同的模型，但截距为-0.43031 +（-0.28566）= -0.71597。

In your second model, Female is a numeric predictor. 在您的第二个模型中，女性是数字预测变量。 The intercept occurs when Female=0 (eg, Sex=M ) and , at -0.71597, is equivalent to the first model. 当Female=0 （例如Sex=M ）且-0.71597处等于第一个模型时，将发生截距。 Again, none of the other parameters is different because thie is a linear model with no interactions. 同样，其他参数没有什么不同，因为这是一个没有相互作用的线性模型。

IMO the "correct" way depends on your audience. IMO的“正确”方式取决于您的听众。 The idiomatic way to deal with categorical variables is the first - make it a factor. 处理分类变量的惯用方法是第一种-使其成为一个因素。 However I have found that with non-technical, or "less-technical" audiences the second way is much easier to explain and understand. 但是，我发现对于非技术人员或“技术程度较低”的受众，第二种方法更容易解释和理解。 Note of course that this applies to dichotomous variables only - if your categorical variable can take on more than two values you must use factors. 当然，请注意，这仅适用于二分变量-如果分类变量可以采用两个以上的值，则必须使用因子。

R中具有二分预测变量的多元线性回归：伪代码还是让R处理？

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-03-21 21:35:10

解决方案2
0 2014-03-21 20:17:22

R中具有二分预测变量的多元线性回归：伪代码还是让R处理？

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-03-21 21:35:10

解决方案2 0 2014-03-21 20:17:22

解决方案1
1 已采纳 2014-03-21 21:35:10

解决方案2
0 2014-03-21 20:17:22