[英]When simulating multivariate data for regression, how can I set the R-squared (example code included)?
I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 我正在尝试模拟三变量数据集,以便我可以在其上运行线性回归模型。 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable. 'X1'和'X2'将是连续的独立变量(mean = 0,sd = 1),'Y'将是连续因变量。
The variables will be regression model will produce coefficients like this: Y = 5 + 3(X1) - 2(X2) 变量将是回归模型将产生如下系数:Y = 5 + 3(X1) - 2(X2)
I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. 我想模拟这个数据集,使得得到的回归模型的R平方值为0.2。 How can I determine the value of 'sd.value' so that the regression model has this R-squared? 如何确定'sd.value'的值,以便回归模型具有此R平方?
n <- 200
set.seed(101)
sd.value <- 1
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))
Take a look at this code, it should be close enough to what you want: 看看这段代码,它应该足够接近你想要的东西:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. 基本上你的问题归结为找出var.epsilon的表达式。 Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. 由于我们有y = b1 + b2 * x1 + b3 * x2 + epsilon,而Xs和epsilon都是独立的,我们有var [y] = b2 ^ 2 * var [x1] + b3 ^ 2 * var [x2] + var [eps],其中var [Xs] = 1假设。 You can then solve for var[eps] as a function of R-squared. 然后,您可以求解var [eps]作为R平方的函数。
So the formula for R^2 is 1-var(residual)/var(total) 所以R ^ 2的公式是1-var(残差)/ var(总计)
In this case, the variance of Y
is going to be 3^2+2^2+sd.value^2
, since we are adding three independent random variables. 在这种情况下, Y
的方差将是3^2+2^2+sd.value^2
,因为我们添加了三个独立的随机变量。 And, asymptotically, the residual variance is going to be simply sd.value^2
. 并且,渐近地,残差方差将简单地为sd.value^2
。
So you can compute rsquared explicitly with this function: 因此,您可以使用此函数显式计算rsquared:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function: 使用小代数,您可以计算此函数的反函数:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared)
should give you what you want. 所以设置sd.value<-rsqi(rsquared)
可以给你你想要的东西。
We can test this as follows: 我们可以测试如下:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct. 所以它看起来是正确的。
This is how I would do it ( blind iterative algorithm , assuming no knowledge, for when you are purely interested in "how to simulate this"): 我就是这样做的( 盲目迭代算法 ,假设没有知识,因为当你纯粹对“如何模拟这个”感兴趣时):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note: 有几点需要注意:
sd.value
. 我让X1和X2变化,因为这影响了这个寻求的sd.value
。 The resulting vector for 10 results is: 得到的10个结果的矢量是:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55
, [1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55
,
which takes roughly 13 seconds on my machine. 我的机器大约需要13秒钟。
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. 我的下一步是从4.5开始,在迭代中加0.001而不是0.01,并且可能会降低容差。 Good luck! 祝好运!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01: 好吧,一些nsim = 100的摘要统计,耗时150秒,步数增加0.001,容差仍然是0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though? 你为什么对此感兴趣?
Here is another code to generate multiple linear regression with errors follow normal distribution: OPS sorry this code just produces multiple regression 这是生成多个线性回归的另一个代码,其中错误遵循正态分布:OPS抱歉此代码只产生多重回归
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){ n.var=length(coefficients) M=matrix(0,ncol=n.var,nrow=n.obs) beta=as.matrix(coefficients) for (i in 1:n.var){ M[,i]=rnorm(n.obs,0,1) } y=M %*% beta + rnorm(n.obs,0,s.deviation) return (list(x=M,y=y,coeff=coefficients)) }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.