简体   繁体   English

R 中的线性 model 不适合

[英]Linear model in R doesn't fit properly

I know that the title doesn't specify exactly what I mean so let me explain it here.我知道标题并没有具体说明我的意思,所以让我在这里解释一下。 I working on a dataset that consists of yield of wheat given a certain wheat type (A,B,C,D).我正在研究一个数据集,该数据集包含给定某种小麦类型(A、B、C、D)的小麦产量。 Now my issue when fitting linear model is that I'm trying to fit:现在我在拟合线性 model 时遇到的问题是我正在尝试拟合:

lm1 = lm(yield ~ type) , when doing so R commits the first wheat type(A) and marks it as a global intercept and then estimates influence of all other types on the yield. lm1 = lm(yield ~ type) ,当这样做时,R 提交第一个小麦类型(A)并将其标记为全局截距,然后估计所有其他类型对产量的影响。 I know that I can fit a linear model like such: lm2 = lm(yield ~ 0 + type) which will give me estimates of the influence of each type on the yield however what I really want to see is a sort of combination of the two of them.我知道我可以像这样拟合线性 model: lm2 = lm(yield ~ 0 + type)这将使我估计每种类型对产量的影响但是我真正想看到的是一种组合他们两个人。

Is there an option to fit a linear model in R st lm3 = lm(yield ~ GlobalIntercept + type) where GlobalIntercept would represent the general intercept of my linear model and then I could see the influence of each type of wheat on that general intercept. Is there an option to fit a linear model in R st lm3 lm3 = lm(yield ~ GlobalIntercept + type) where GlobalIntercept would represent the general intercept of my linear model and then I could see the influence of each type of wheat on that general intercept. So kind of like in the first model though this time we'd estimate the influence of all types of wheat (A,B,C,D) on the general yield.有点像第一个 model 虽然这次我们要估计所有类型的小麦(A,B,C,D)对总产量的影响。

Questions to SO should include minimal reproducible example data -- see instructions at the top of the tag page.对 SO 的问题应包括最少的可重复示例数据——请参阅标签页顶部的说明。 Since the question did not include this we will provide it this time by using the built-in InsectSprays data set that comes with R.由于问题没有包括这个,我们这次将使用 R 附带的内置InsectSprays数据集来提供它。

Here are a few approaches:这里有一些方法:

1) lm/contr.sum/dummy.coef Try using contr.sum sum-to-zero contrasts for the spray factor and look at the dummy coefficients. 1) lm/contr.sum/dummy.coef尝试使用contr.sum sum-to-zero contrasts for the spray factor 并查看虚拟系数。 That will expand the coefficients to include all 6 levels of the spray factor in this example:这将扩展系数以包括此示例中spray因子的所有 6 个水平:

fm <- lm(count ~ spray, InsectSprays, contrasts = list(spray = contr.sum))
dummy.coef(fm)
## Full coefficients are 
##                                                                           
## (Intercept):          9.5                                                  
## spray:                  A         B         C         D         E         F
##                  5.000000  5.833333 -7.416667 -4.583333 -6.000000  7.166667

sum(dummy.coef(fm)$spray)  # check that coefs sum to zero
## [1] 0

2) tapply If each level has the same number of rows in the data set such as is the case with InsectSprays where each level has 12 rows then we can take the mean for each level and then subtract the Intercept (which is the overall mean). 2) tapply如果每个级别在数据集中具有相同的行数,例如InsectSprays的情况,每个级别有 12 行,那么我们可以取每个级别的平均值,然后减去 Intercept(这是整体平均值) . This does not work if the data set is unbalanced, ie if the different levels have different numbers of rows.如果数据集不平衡,即如果不同级别具有不同的行数,这将不起作用。 Note how the calculations below give the same result as (1).请注意下面的计算如何给出与 (1) 相同的结果。

mean(InsectSprays$count)  # intercept
## [1] 9.5

with(InsectSprays, tapply(count, spray, mean) - mean(count))
##         A         B         C         D         E         F 
##  5.000000  5.833333 -7.416667 -4.583333 -6.000000  7.166667 

3) aov/model.tables We can also use aov with model.tables like this: 3) aov/model.tables我们也可以将aovmodel.tables一起使用,如下所示:

fm2 <- aov(count ~ spray, InsectSprays)
model.tables(fm2)
## Tables of effects
##
##  spray 
## spray
##      A      B      C      D      E      F 
##  5.000  5.833 -7.417 -4.583 -6.000  7.167 

model.tables(fm2, type = "means")
## Tables of means
## Grand mean
##    
## 9.5 
##
##  spray 
## spray
##      A      B      C      D      E      F 
## 14.500 15.333  2.083  4.917  3.500 16.667 

4) emmeans We can use lm followed by emmeans like this: 4) emmeans我们可以使用 lm 后跟 emmeans,如下所示:

library(emmeans)

fm <- lm(count ~ spray, InsectSprays)
emmeans(fm, "spray")
##  spray emmean   SE df lower.CL upper.CL
##  A      14.50 1.13 66   12.240    16.76
##  B      15.33 1.13 66   13.073    17.59
##  C       2.08 1.13 66   -0.177     4.34
##  D       4.92 1.13 66    2.656     7.18
##  E       3.50 1.13 66    1.240     5.76
##  F      16.67 1.13 66   14.406    18.93
##
## Confidence level used: 0.95 

As per the information provided by you, I could infer that you are modeling the yield as a linear function of type which has four categories.根据您提供的信息,我可以推断您正在将产量建模为具有四个类别的线性 function 类型。 Your expectation is to have an intercept apart from the coefficients of each of the types.您的期望是除每种类型的系数外都有一个截距。 This doesn't make sense.这没有意义。

You are predicting the yield based on nominal variable.您正在根据名义变量预测收益率。 If you want to have regression with intercept, you need to have the predictor variable with origin.如果要进行带截距的回归,则需要具有带原点的预测变量。 The property of a nominal variable is that it doesn't have origin.名义变量的特性是它没有原点。 The origin means that the zero value for the predictor.原点表示预测变量的零值。 A nominal variable cannot have an origin.名义变量不能有原点。 In other words, the intercept (with a continuous predictor variable) means the value of the dependent variable y, when the predictor value is zero (in your case, the category of the type is zero which is practically impossible).换句话说,截距(具有连续预测变量)表示因变量 y 的值,当预测值为零时(在您的情况下,类型的类别为零,这实际上是不可能的)。 That is why your model takes one of the categories as a reference category and calculates the intercept for it.这就是为什么您的 model 将类别之一作为参考类别并为其计算截距的原因。 The changes in the y variable when the category is different than the reference category is given by the coefficients.当类别与参考类别不同时,y 变量的变化由系数给出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM