What is the measure of how well a data centered to the prediction line in LM

Question

I have two datasets with which I plot using R's lm command. The first plot below is not centered towards the red line. But the second graphs on the right is centered towards the line.

数据1 数据2

My questions are:

What is the measure of how well the data centered to the line?
How to extract that from the data structure?

The code I use to plot that data is simply:

 data <-read.table("myfile.txt")
 dat1x <- data$x1
 dat1y <- data$y1


 # plot left figure
 dat1_lm <- lm(dat1x ~ dat1y)
 plot(dat1x ~ dat1y)
 abline(coef(dat1_lm),col="red")
 dat1_lm.r2  <- summary(dat1_lm)$adj.r.squared;

 # repeat the the same for right figure
 dat2x <- data$x2
 dat2y <- data$y2
 dat2_lm <- lm(dat2x ~ dat2y)
 plot(dat2x ~ dat2y)
 abline(coef(dat2_lm),col="red")
 dat2_lm.r2  <- summary(dat2_lm)$adj.r.squared;

Update Plot with RMSE Score:

F1g1 在此输入图像描述

I am looking for a score that shows right figure is better than the left based on data centering towards the prediction line.

Answer 1

The R-squared gives the goodness of fit of the line, ie the percentage of variation in the dataset that is explained by the linear model. Another way of explaining the R-squared is how much better does the model perform than the mean model. The p-values gives the significance of the fit., ie is the coefficient of the linear model significantly different from zero.

To extract these values:

dat = data.frame(a = runif(100), b = runif(100))
lm_obj = lm(a~b, dat)
rsq = summary(lm_obj)[["r.squared"]]
p_value = summary(lm_obj)[["coefficients"]]["b","Pr(>|t|)"]

Alternatively, you could calculate the RMSE between the observations and the outcome of the linear model:

rmse = sqrt(mean((dat$a - predict(lm_obj))^2))

Note that this is the RMSE of a and the linear model. If you want the RMSE of a and b :

rmse = sqrt(mean((dat$a - dat$b)^2))

Answer 2

What you might be looking for is MAPE (Mean absolute percentage error). Its advantages over other measures of accuracy (MSE, MPE, RMSE, MAE, etc.) is that MAPE does not depend on levels, it measures absolute errors and it has a clear meaning. You could use a package forecast to get some of these measures:

library(forecast)
data <- data.frame(y = rnorm(100), x = rnorm(100))
model <- lm(y ~ x, data)
accuracy(model)
#           ME         RMSE          MAE          MPE         MAPE 
# 5.455773e-18 1.019446e+00 7.957585e-01 1.198441e+02 1.205495e+02 
accuracy(model)["MAPE"]
#     MAPE 
# 120.5495

or

mape <- function(f, x) mean(abs(1 - f / x) * 100)
mape(fitted(model), data$y)
# [1] 120.5495

On the other hand, it might look that MPE (Mean percentage error) is better for showing how well data is centered around the prediction line, eg let prediction be p <- rep(2, 20) and data y <- rep(c(3,1), 10) , then MPE = 0 but MAPE = 100% .

So you should decide what you really want to show, MAPE is better as a measure of accuracy, but for you second example MPE might be a better choice.

Update: in case it really is centering what you want to check, you should look at measures that sum errors without any squares, absolute values, etc. That is, you also might want to take a look at ME (Mean error), which is a bit simpler than MPE, but has different interpretation. Here is an example somewhat similar to the first one of yours:

在此输入图像描述

mpe <- function(f, x) mean((1 - f / x) * 100)
mape <- function(f, x) mean(abs(1 - f / x) * 100)
me <- function(f, x) mean(x - f)

set.seed(20130130)
y1 <- rnorm(1000, mean = 10, sd = 1.5) * (1:1000) / 300
y2 <- rnorm(1000, mean = 10, sd = 1.7) * (1:1000) / 250
pr <- (1:1000) / 30

data <- data.frame(y = c(y1, y2),
                   x = 1:1000,
                   prediction = rep(pr, 2),
                   id = rep(1:2, each = 1000))

results <- data.frame(MAPE = c(mape(pr, y1), mape(pr, y2)),
                      MPE = c(mpe(pr, y1), mpe(pr, y2)),
                      ME = c(me(pr, y1), me(pr, y2)),
                      id = 1:2)
results <- round(results, 2)

ggplot(data, aes(x, y)) + geom_line() + theme_bw() +
  facet_wrap(~ id) + geom_line(aes(y = prediction), colour = "red") +
  theme(strip.background = element_blank()) + labs(y = NULL, x = NULL) +
  geom_text(data = results, x = 150, y = 50, aes(label = paste("MAPE:", MAPE))) +
  geom_text(data = results, x = 150, y = 45, aes(label = paste("MPE:", MPE))) + 
  geom_text(data = results, x = 150, y = 40, aes(label = paste("ME:", ME)))

What is the measure of how well a data centered to the prediction line in LM

Question

2 answers

solution1
5 2013-01-29 10:29:01

solution2
1 2013-01-29 13:49:05

What is the measure of how well a data *centered* to the prediction line in LM

Question

2 answers

solution1 5 2013-01-29 10:29:01

solution2 1 2013-01-29 13:49:05

What is the measure of how well a data centered to the prediction line in LM

solution1
5 2013-01-29 10:29:01

solution2
1 2013-01-29 13:49:05