How to fit a known linear equation to my data in R?

Question

I used a linear model to obtain the best fit to my data, lm() function. From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?

This is my data (y = modelled, x = measured)

measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)

modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)

Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?

I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?

Best regards Cyril

Answer 1

Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e , where e are the normally distributed residuals.

You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.

The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:

summary(lm(measured ~ modelled))
#> 
#> Call:
#> lm(formula = measured ~ modelled)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -103.328  -39.130   -4.881   40.428  114.829 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 23.09461   13.11026   1.762    0.083 .  
#> modelled     0.91143    0.07052  12.924   <2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared:  0.7261, Adjusted R-squared:  0.7218 
#> F-statistic:   167 on 1 and 63 DF,  p-value: < 2.2e-16

This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.

But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"

Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e , you just need to know the proportion of the variance of measured that is explained by modelled . In other words:

1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672

This is pretty close to the r squared from the lm call.

I think you have sufficient evidence to say that your data is consistent with the model measured = modelled , in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.

Answer 2

As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.

If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:

R2 = MSS/TSS = (TSS − RSS)/TSS

See this link for definition of RSS and TSS.

We can only work with observations that are complete (non NA). So we calculate each of them:

TSS = nonNA  = !is.na(modelled) & !is.na(measured) 
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)    

1 - RSS/TSS
[1] 0.7116585

Answer 3

If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then

summary(fm)

will show the R^2 and p value of that model.

The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.

In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.

For example, if the undisclosed model is the following then using the builtin CO2 data frame:

fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)

we have the this output where the last two lines show R squared and p value.

Call:
lm(formula = uptake ~ Type + conc, data = CO2)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.2145  -4.2549   0.5479   5.3048  12.9968 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      25.830052   1.579918  16.349  < 2e-16 ***
TypeMississippi -12.659524   1.544261  -8.198 3.06e-12 ***
conc              0.017731   0.002625   6.755 2.00e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared:  0.5821,    Adjusted R-squared:  0.5718 
F-statistic: 56.42 on 2 and 81 DF,  p-value: 4.498e-16

How to fit a known linear equation to my data in R?

Question

3 answers

solution1
2 ACCPTED 2020-10-31 13:04:26

solution2
2 2020-10-31 13:07:05

solution3
1 2020-10-31 18:41:35

How to fit a known linear equation to my data in R?

Question

3 answers

solution1 2 ACCPTED 2020-10-31 13:04:26

solution2 2 2020-10-31 13:07:05

solution3 1 2020-10-31 18:41:35

solution1
2 ACCPTED 2020-10-31 13:04:26

solution2
2 2020-10-31 13:07:05

solution3
1 2020-10-31 18:41:35