How to get regression coefficients and model fits using correlation or covariance matrix instead of data frame using R?

Question

I want to be able to regression coefficients from multiple linear regression by supplying a correlation or covariance matrix instead of a data.frame. I realise you lose some information relevant to determining the intercept and so on, but it should even the correlation matrix should be sufficient for getting standardised coefficients and estimates of variance explained.

So for example, if you had the following data

# get some data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]

You could run a regression as follows:

lm(EngineSize ~ Horsepower + RPM, x)

but what if instead of having data you had the correlation matrix or the covariance matrix:

corx <- cor(x)
covx <- cov(x)

What function in R allows you to run a regression based on the correlation or covariance matrix? Ideally it should be similar to lm so that you can easily obtain things like r-squared, adjusted r-squared, predicted values and so on. Presumably, for some of these things, you would need to also provide the sample size and possibly a vector of means. But that would also be fine.

Ie, something like:

lm(EngineSize ~ Horsepower + RPM, cov = covx) # obviously this doesn't work

Note that this answer on Stats.SE provides a theoretical explanation for why it's possible, and provides an example of some custom R code for calculating coefficients?

Answer 1

Using lavaan you could do the following:

library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]

lav.input<- cov(x)
lav.mean <- colMeans(x)

library(lavaan)
m1 <- 'EngineSize ~ Horsepower+RPM'
fit <- sem(m1, sample.cov = lav.input,sample.nobs = nrow(x), meanstructure = TRUE, sample.mean = lav.mean)
summary(fit, standardize=TRUE)

Results are:

Regressions:
                   Estimate    Std.Err  Z-value  P(>|z|)   Std.lv    Std.all
  EngineSize ~                                                              
    Horsepower          0.015    0.001   19.889    0.000      0.015    0.753
    RPM                -0.001    0.000  -15.197    0.000     -0.001   -0.576

Intercepts:
                  Estimate    Std.Err  Z-value  P(>|z|)   Std.lv    Std.all
   EngineSize          5.805    0.362   16.022    0.000      5.805    5.627

Variances:
                  Estimate    Std.Err  Z-value  P(>|z|)   Std.lv    Std.all
    EngineSize          0.142    0.021    6.819    0.000      0.142    0.133

Answer 2

Remember that:

$beta=(X'X)^-1. X'Y$

Try:

(bs<-solve(covx[-1,-1],covx[-1,1]))

 Horsepower         RPM 
 0.01491908 -0.00100051

For the Intercept you will need averages of the variables. For example:

  ms=colMeans(x)
  (b0=ms[1]-bs%*%ms[-1])

         [,1]
[1,] 5.805301

Answer 3

I think lavaan sounds like a good option, I note that @Philip pointed me in the right direction. I just mention here how to extract a few extra model features using lavaan (particularly, r-squared and adjusted r-squared) that you might want.

For the latest version see: https://gist.github.com/jeromyanglim/9f766e030966eaa1241f10bd7d6e2812 :

# get data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]

# define sample statistics 
covx <- cov(x)
n <- nrow(x)
means <- sapply(x, mean) # this is optional


fit <- lavaan::sem("EngineSize ~ Horsepower + RPM", sample.cov = covx,
                   sample.mean = means,
                    sample.nobs = n)

coef(fit) # unstandardised coefficients
standardizedSolution(fit) # Standardised coefficients
inspect(fit, 'r2') # r-squared

# adjusted r-squared
adjr2 <- function(rsquared, n, p) 1 - (1-rsquared)  * ((n-1)/(n-p-1))
# update p below with number of predictor variables
adjr2(inspect(fit, 'r2'), n = inspect(fit, "nobs"), p = 2)

Custom function

And here is a bit of a function that supplies the fit from lavaan along with a few features of relevance (ie, basically packaging most of the above). It assumes a case where you don't have the means.

covlm <- function(dv, ivs, n, cov) {
    # Assumes lavaan package
    # library(lavaan)
    # dv: charcter vector of length 1 with name of outcome variable
    # ivs: character vector of names of predictors
    # n: numeric vector of length 1: sample size
    # cov: covariance matrix where row and column names 
    #       correspond to dv and ivs
    # Return
    #      list with lavaan model fit
    #      and various other features of the model

    results <- list()
    eq <- paste(dv, "~", paste(ivs, collapse = " + "))
    results$fit <- lavaan::sem(eq, sample.cov = cov,
                       sample.nobs = n)

    # coefficients
    ufit <- parameterestimates(results$fit) 
    ufit <- ufit[ufit$op == "~", ]
    results$coef <- ufit$est
    names(results$coef) <- ufit$rhs

    sfit <- standardizedsolution(results$fit) 
    sfit <- sfit[sfit$op == "~", ]
    results$standardizedcoef <- sfit$est.std
    names(results$standardizedcoef) <- sfit$rhs

    # use unclass to not limit r2 to 3 decimals
     results$r.squared <- unclass(inspect(results$fit, 'r2')) # r-squared

    # adjusted r-squared
      adjr2 <- function(rsquared, n, p) 1 - (1-rsquared)  * ((n-1)/(n-p-1))
    results$adj.r.squared <- adjr2(unclass(inspect(results$fit, 'r2')), 
                                n = n, p = length(ivs)) 
    results

}

For example:

x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
covlm(dv = "EngineSize", ivs = c("Horsepower", "RPM"),
      n = nrow(x), cov = cov(x))

This all produces:

$fit
lavaan (0.5-20) converged normally after  27 iterations

  Number of observations                            93

  Estimator                                         ML
  Minimum Function Test Statistic                0.000
  Degrees of freedom                                 0
  Minimum Function Value               0.0000000000000

$coef
 Horsepower         RPM 
 0.01491908 -0.00100051 

$standardizedcoef
Horsepower        RPM 
 0.7532350 -0.5755326 

$r.squared
EngineSize 
     0.867 

$adj.r.squared
EngineSize 
     0.864

Answer 4

Another kind of funky solution is to generate a data set that has the same variance-covariance matrix as the original data. You can do this with mvrnorm() in the MASS package. Using lm() on this new data set will yield parameter estimates and standard errors identical to those that would have been estimated from the original data set (except for the intercept, which is inaccessible unless you have the means of each variable). Here's an example of what this would look like:

#Assuming the variance covariance matrix is called VC
n <- 100 #sample size
nvar <- ncol(VC)
fake.data <- mvrnorm(n, mu = rep(0, nvar), sigma = VC, empirical = TRUE)
lm(Y~., data = fake.data)

How to get regression coefficients and model fits using correlation or covariance matrix instead of data frame using R?

Question

4 answers

solution1
2 ACCPTED 2016-07-25 01:18:25

solution2
1 2016-07-25 00:59:41

solution3
1 2016-07-25 01:37:10

Custom function

solution4
0 2017-08-25 21:39:32

How to get regression coefficients and model fits using correlation or covariance matrix instead of data frame using R?

Question

4 answers

solution1 2 ACCPTED 2016-07-25 01:18:25

solution2 1 2016-07-25 00:59:41

solution3 1 2016-07-25 01:37:10

Custom function

solution4 0 2017-08-25 21:39:32

solution1
2 ACCPTED 2016-07-25 01:18:25

solution2
1 2016-07-25 00:59:41

solution3
1 2016-07-25 01:37:10

solution4
0 2017-08-25 21:39:32