简体   繁体   中英

Generate a binary variable with a predefined correlation to an already existing variable

For a simulation study, I want to generate a set of random variables (both continuous and binary) that have predefined associations to an already existing binary variable, denoted here as x .

For this post, assume that x is generated following the code below. But remember: in real life, x is an already existing variable.

set.seed(1245)
x <- rbinom(1000, 1, 0.6)

I want to generate both a binary variable and a continuous variable. I have figured out how to generate a continuous variable (see code below)

set.seed(1245)

cor <- 0.8 #Correlation 
y <- rnorm(1000, cor*x, sqrt(1-cor^2))

But I can't find a way to generate a binary variable that is correlated to the already existing variable x . I found several R packages, such as copula which can generate random variables with a given dependency structure. However, they do not provide a possibility to generate variables with a set dependency on an already existing variable.

Does anyone know how to do this in an efficient way?

Thanks!

If we look at the formula for correlation:

在此处输入图像描述

For the new vector y, if we preserve the mean, the problem is easier to solve. That means we copy the vector x and try to flip a equal number of 1s and 0s to achieve the intended correlation value.

If we let E(X) = E(Y) = x_bar , and E(XY) = xy_bar , then for a given rho, we simplify the above to:

(xy_bar - x_bar^2) / (x_bar - x_bar^2) =  rho

Solve and we get:

xy_bar = rho * x_bar + (1-rho)*x_bar^2

And we can derive a function to flip a number of 1s and 0s to get the result:

create_vector = function(x,rho){

  n = length(x)
  x_bar = mean(x)
  xy_bar = rho * x_bar + (1-rho)*x_bar^2
  toflip = sum(x == 1) - round(n * xy_bar)

  y = x
  y[sample(which(x==0),toflip)] = 1
  y[sample(which(x==1),toflip)] = 0
  return(y)
}

For your example it works:

set.seed(1245)
x <- rbinom(1000, 1, 0.6)
cor(x,create_vector(x,0.8))
[1] 0.7986037

There are some extreme combinations of intended rho and p where you might run into problems, for example:

set.seed(111)

res = lapply(1:1000,function(i){
             
              this_rho = runif(1)
              this_p = runif(1)
              x = rbinom(1000,1,this_p)
              data.frame(
                intended_rho = this_rho,
                p = this_p,
                resulting_cor = cor(x,create_vector(x,this_rho))
              )
           })

res = do.call(rbind,res)

ggplot(res,aes(x=intended_rho,y=resulting_cor,col=p)) + geom_point()

在此处输入图像描述

Here's a binomial one - the formula for q only depends on the mean of x and the correlation you desire.

set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- 1/((1-p)/cor^2+p)
y <- rbinom(100000, 1, q)
z <- x*y
cor(x,z)
#> [1] 0.7984781

This is not the only way to do this - note that mean(z) is always less than mean(x) in this construction.

The continuous variable is even less well defined - do you really not care about its mean/variance, or anything else about its distibution?

Here's another simple version where it flips the variable both ways:

set.seed(1245)
cor <- 0.8
x <- rbinom(100000, 1, 0.6)
p <- mean(x)
q <- (1+cor/sqrt(1-(2*p-1)^2*(1-cor^2)))/2
y <- rbinom(100000, 1, q)
z <- x*y+(1-x)*(1-y)
cor(x,z)
#> [1] 0.8001219
mean(z)
#> [1] 0.57908

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM