简体   繁体   中英

When simulating multivariate data for regression, how can I set the R-squared (example code included)?

I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.

The variables will be regression model will produce coefficients like this: Y = 5 + 3(X1) - 2(X2)

I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?

n <- 200 
set.seed(101) 
sd.value <- 1

X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)

simdata <- data.frame(X1, X2, Y)

summary(lm(Y ~ X1 + X2, data=simdata))

Take a look at this code, it should be close enough to what you want:

simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
    stopifnot(length(beta) == 3)
    df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs))  # x1 and x2 are independent
    var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
    stopifnot(var.epsilon > 0)
    df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
    df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
    return(df)
}
get.R.sq <- function(desired) {
    model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
    return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)

Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.

So the formula for R^2 is 1-var(residual)/var(total)

In this case, the variance of Y is going to be 3^2+2^2+sd.value^2 , since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2 .

So you can compute rsquared explicitly with this function:

rsq<-function(x){1-x^2/(9+ 4+x^2)}

With a little algebra, you can compute the inverse of this function:

rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}

So setting sd.value<-rsqi(rsquared) should give you what you want.

We can test this as follows:

simrsq<-function(x){
  Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
  simdata <- data.frame(X1, X2, Y)
  summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}

> meanrsq<-rep(0,9)
> for(i in 1:50)
+   meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187

So it looks to be correct.

This is how I would do it ( blind iterative algorithm , assuming no knowledge, for when you are purely interested in "how to simulate this"):

simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
  set.seed(seed)
  sd.value <- 1
  rsquare <- 1:nsim
  results <- 1:nsim
  for (i in 1:nsim) {
    # tracking iteration: if we miss the value, abort at sd.value > 7.
    iter <- 0 
    while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
      sd.value <- sd.value + 0.01
      rsquare[i] <- simulate.sd.iter(sd.value, n)
      iter <- iter + 1
      if (iter > 3000) { break }
    }
    results[i] <- sd.value  # store the current sd.value that is OK!
    sd.value <- 1
  }
  cbind(results, rsquare)
}

simulate.sd.iter <- function(sd.value, n=200) {  # helper function
  # Takes the sd.value, creates data, and returns the r-squared
  X1 <- rnorm(n, 0, 1)
  X2 <- rnorm(n, 0, 1)
  Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
  simdata <- data.frame(X1, X2, Y)
  return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}

simulate.sd()

A few things to note:

  • I let the X1 and X2 vary, since this affects this sought sd.value .
  • The tolerance is how exact you want this estimate to be. Are you fine with an r-squared of ~0.19 or ~0.21? Have the tolerance be 0.01.
  • Note that a too precise tolerance might not allow you to find a result.
  • The value of 1 is quite a bad starting value, making this iterative algorithm quite slow.

The resulting vector for 10 results is:

[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55 ,

which takes roughly 13 seconds on my machine.

My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!

Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 4.513   4.913   5.036   5.018   5.157   5.393 

Why are you interested in this though?

Here is another code to generate multiple linear regression with errors follow normal distribution: OPS sorry this code just produces multiple regression

 sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){ n.var=length(coefficients) M=matrix(0,ncol=n.var,nrow=n.obs) beta=as.matrix(coefficients) for (i in 1:n.var){ M[,i]=rnorm(n.obs,0,1) } y=M %*% beta + rnorm(n.obs,0,s.deviation) return (list(x=M,y=y,coeff=coefficients)) } 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM