简体   繁体   English

如何将已知的线性方程拟合到 R 中的数据?

[英]How to fit a known linear equation to my data in R?

I used a linear model to obtain the best fit to my data, lm() function.我使用线性模型来获得最适合我的数据的 lm() 函数。 From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data?从文献中我知道最佳拟合是斜率 = 1 和截距 = 0 的线性回归。我想看看这个方程 (y=x) 对我的数据有多好? How do I proceed in order to find an R^2 as well as a p-value?我如何继续以找到 R^2 和 p 值?

This is my data (y = modelled, x = measured)这是我的数据(y = 建模,x = 测量)

measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)

modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)

Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?现在我想看看方程 y=x 与上面提供的数据(R^2 和 p 值)的拟合程度如何?

I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?如果有人可以帮助我解决这个(基本)问题,我将非常感激,因为我在 stackoverflow 上找不到我的问题的答案?

Best regards Cyril最好的问候西里尔

Let's be clear what you are asking here.让我们弄清楚你在这里问什么。 You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e , where e are the normally distributed residuals.您有一个现有模型,即“ modelled值是measured值的预期值”,或者换句话说, measured = modelled + e ,其中e是正态分布残差。

You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.您说“最佳拟合”应该是截距为 0 且斜率为 1 的直线,这是另一种说法。

The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:问题是,这种“最佳拟合”并不是您实际数据的最佳拟合,我们可以通过以下方式轻松看到:

summary(lm(measured ~ modelled))
#> 
#> Call:
#> lm(formula = measured ~ modelled)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -103.328  -39.130   -4.881   40.428  114.829 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 23.09461   13.11026   1.762    0.083 .  
#> modelled     0.91143    0.07052  12.924   <2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared:  0.7261, Adjusted R-squared:  0.7218 
#> F-statistic:   167 on 1 and 63 DF,  p-value: < 2.2e-16

This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.这向我们展示了在减少残差平方和方面可以对您的数据产生最佳拟合的线。

But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"但我想你要问的是“我的数据与模型的拟合程度如何measured = modelled + e ?”

Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question.试图强迫lm给你一个固定的截距和斜率可能不是回答这个问题的最好方法。 Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that.请记住,斜率的 p 值只能告诉您实际斜率是否与 0 显着不同。上述模型已经证实了这一点。 If you want to know the r-squared of measured = modelled + e , you just need to know the proportion of the variance of measured that is explained by modelled .如果你想知道measured = modelled + e的r 平方,你只需要知道由modelled解释的measured方差的比例。 In other words:换句话说:

1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672

This is pretty close to the r squared from the lm call.这非常接近lm调用的 r 平方。

I think you have sufficient evidence to say that your data is consistent with the model measured = modelled , in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.我认为您有足够的证据表明您的数据与模型measured = modelled一致,因为lm模型中的斜率包括其 95% 置信区间内的值 1,截距包含其 95% 内的值 0置信区间。

As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.正如评论中提到的,您可以使用lm()函数,但这实际上是为您估计斜率和截距,而您想要的是不同的东西。

If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value.如果斜率 = 1 且截距 = 0,则本质上您有一个拟合并且您的modelled已经是预测值。 You need the r-square from this fit.你需要这个拟合的 r 平方。 R squared is defined as: R平方定义为:

R2 = MSS/TSS = (TSS − RSS)/TSS R2 = MSS/TSS = (TSS − RSS)/TSS

See this link for definition of RSS and TSS.请参阅此链接以了解 RSS 和 TSS 的定义。

We can only work with observations that are complete (non NA).我们只能处理完整的观察(非 NA)。 So we calculate each of them:所以我们计算它们中的每一个:

TSS = nonNA  = !is.na(modelled) & !is.na(measured) 
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)    

1 - RSS/TSS
[1] 0.7116585

If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then如果measuredmodelled应该代表未公开模型的实际值和拟合值,如另一个答案下面的评论中所述,那么如果fm是该未公开模型的lm对象,则

summary(fm)

will show the R^2 and p value of that model.将显示该模型的 R^2 和 p 值。

The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. R 平方值实际上可以仅使用measuredmodelled来计算,但是如果未公开的模型中有或没有截距,则公式会有所不同。 The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.迹象是没有截距,因为如果有截距sum(modelled - measured, an.rm = TRUE)应该是 0 但实际上它远非如此。

In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.在任何情况下,R^2 和 p 值都显示在 summary(fm) 的输出中,其中 fm 是未公开的线性模型,因此如果您有未公开的lm对象,则没有必要将讨论限制为measuredmodelled模型。

For example, if the undisclosed model is the following then using the builtin CO2 data frame:例如,如果未公开的模型如下,则使用内置的CO2数据框:

fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)

we have the this output where the last two lines show R squared and p value.我们有这个输出,其中最后两行显示 R 平方和 p 值。

Call:
lm(formula = uptake ~ Type + conc, data = CO2)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.2145  -4.2549   0.5479   5.3048  12.9968 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      25.830052   1.579918  16.349  < 2e-16 ***
TypeMississippi -12.659524   1.544261  -8.198 3.06e-12 ***
conc              0.017731   0.002625   6.755 2.00e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared:  0.5821,    Adjusted R-squared:  0.5718 
F-statistic: 56.42 on 2 and 81 DF,  p-value: 4.498e-16

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM