简体   繁体   English


[英]When simulating multivariate data for regression, how can I set the R-squared (example code included)?

I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 我正在尝试模拟三变量数据集,以便我可以在其上运行线性回归模型。 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable. 'X1'和'X2'将是连续的独立变量(mean = 0,sd = 1),'Y'将是连续因变量。

The variables will be regression model will produce coefficients like this: Y = 5 + 3(X1) - 2(X2) 变量将是回归模型将产生如下系数:Y = 5 + 3(X1) - 2(X2)

I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. 我想模拟这个数据集,使得得到的回归模型的R平方值为0.2。 How can I determine the value of 'sd.value' so that the regression model has this R-squared? 如何确定'sd.value'的值,以便回归模型具有此R平方?

n <- 200 
sd.value <- 1

X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)

simdata <- data.frame(X1, X2, Y)

summary(lm(Y ~ X1 + X2, data=simdata))

Take a look at this code, it should be close enough to what you want: 看看这段代码,它应该足够接近你想要的东西:

simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
    stopifnot(length(beta) == 3)
    df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs))  # x1 and x2 are independent
    var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
    stopifnot(var.epsilon > 0)
    df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
    df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
get.R.sq <- function(desired) {
    model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
abline(a=0, b=1, col="red", lty=2)

Basically your question comes down to figuring out the expression for var.epsilon. 基本上你的问题归结为找出var.epsilon的表达式。 Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. 由于我们有y = b1 + b2 * x1 + b3 * x2 + epsilon,而Xs和epsilon都是独立的,我们有var [y] = b2 ^ 2 * var [x1] + b3 ^ 2 * var [x2] + var [eps],其中var [Xs] = 1假设。 You can then solve for var[eps] as a function of R-squared. 然后,您可以求解var [eps]作为R平方的函数。

So the formula for R^2 is 1-var(residual)/var(total) 所以R ^ 2的公式是1-var(残差)/ var(总计)

In this case, the variance of Y is going to be 3^2+2^2+sd.value^2 , since we are adding three independent random variables. 在这种情况下, Y的方差将是3^2+2^2+sd.value^2 ,因为我们添加了三个独立的随机变量。 And, asymptotically, the residual variance is going to be simply sd.value^2 . 并且,渐近地,残差方差将简单地为sd.value^2

So you can compute rsquared explicitly with this function: 因此,您可以使用此函数显式计算rsquared:

rsq<-function(x){1-x^2/(9+ 4+x^2)}

With a little algebra, you can compute the inverse of this function: 使用小代数,您可以计算此函数的反函数:


So setting sd.value<-rsqi(rsquared) should give you what you want. 所以设置sd.value<-rsqi(rsquared)可以给你你想要的东西。

We can test this as follows: 我们可以测试如下:

  Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
  simdata <- data.frame(X1, X2, Y)
  summary(lm(Y ~ X1 + X2, data=simdata))$r.squared

> meanrsq<-rep(0,9)
> for(i in 1:50)
+   meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187

So it looks to be correct. 所以它看起来是正确的。

This is how I would do it ( blind iterative algorithm , assuming no knowledge, for when you are purely interested in "how to simulate this"): 我就是这样做的( 盲目迭代算法 ,假设没有知识,因为当你纯粹对“如何模拟这个”感兴趣时):

simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
  sd.value <- 1
  rsquare <- 1:nsim
  results <- 1:nsim
  for (i in 1:nsim) {
    # tracking iteration: if we miss the value, abort at sd.value > 7.
    iter <- 0 
    while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
      sd.value <- sd.value + 0.01
      rsquare[i] <- simulate.sd.iter(sd.value, n)
      iter <- iter + 1
      if (iter > 3000) { break }
    results[i] <- sd.value  # store the current sd.value that is OK!
    sd.value <- 1
  cbind(results, rsquare)

simulate.sd.iter <- function(sd.value, n=200) {  # helper function
  # Takes the sd.value, creates data, and returns the r-squared
  X1 <- rnorm(n, 0, 1)
  X2 <- rnorm(n, 0, 1)
  Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
  simdata <- data.frame(X1, X2, Y)
  return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)


A few things to note: 有几点需要注意:

  • I let the X1 and X2 vary, since this affects this sought sd.value . 我让X1和X2变化,因为这影响了这个寻求的sd.value
  • The tolerance is how exact you want this estimate to be. 容差是您想要这个估计的精确程度。 Are you fine with an r-squared of ~0.19 or ~0.21? r平方为~0.19或~0.21,你还好吗? Have the tolerance be 0.01. 公差为0.01。
  • Note that a too precise tolerance might not allow you to find a result. 请注意,过于精确的公差可能无法让您找到结果。
  • The value of 1 is quite a bad starting value, making this iterative algorithm quite slow. 值1是一个相当糟糕的起始值,使得这个迭代算法非常慢。

The resulting vector for 10 results is: 得到的10个结果的矢量是:

[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55 , [1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55

which takes roughly 13 seconds on my machine. 我的机器大约需要13秒钟。

My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. 我的下一步是从4.5开始,在迭代中加0.001而不是0.01,并且可能会降低容差。 Good luck! 祝好运!

Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01: 好吧,一些nsim = 100的摘要统计,耗时150秒,步数增加0.001,容差仍然是0.01:

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 4.513   4.913   5.036   5.018   5.157   5.393 

Why are you interested in this though? 你为什么对此感兴趣?

Here is another code to generate multiple linear regression with errors follow normal distribution: OPS sorry this code just produces multiple regression 这是生成多个线性回归的另一个代码,其中错误遵循正态分布:OPS抱歉此代码只产生多重回归

 sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){ n.var=length(coefficients) M=matrix(0,ncol=n.var,nrow=n.obs) beta=as.matrix(coefficients) for (i in 1:n.var){ M[,i]=rnorm(n.obs,0,1) } y=M %*% beta + rnorm(n.obs,0,s.deviation) return (list(x=M,y=y,coeff=coefficients)) } 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM