简体   繁体   English

R中模拟高维多元正态数据

[英]Simulate high dimension multivariate normal data in R

I am trying to simulate high dimension multivariate normal data in R with n = 100, and p = 400 (two different groups of variables with some correlations).我正在尝试模拟 R 中的高维多元正态数据,其中 n = 100,p = 400(具有某些相关性的两组不同变量)。 Below are my codes:以下是我的代码:

## load library MASS
library(MASS)

## sample size set to n = 100
sample_size <- 100      

## I try to simulate two different groups of variables for each with 200 variables                  
sample_meanvector <- c(runif(200,0,1), runif(200,6,8))

## covariance matrix, some variables set to be correlated
sample_covariance_matrix <- matrix(NA, nrow = 400, ncol = 400)
diag(sample_covariance_matrix) <- 1
set.seed(666)
sample_covariance_matrix[lower.tri(sample_covariance_matrix)] <- runif(79800, 0.00001, 0.2)
sample_covariance_matrix[lower.tri(sample_covariance_matrix)][sample(1:79800, 10000)] <- runif(10000, 0.6, 0.9)

## make the matrix symmetric
sample_covariance_matrix[upper.tri(sample_covariance_matrix)]<-t(sample_covariance_matrix)[upper.tri(sample_covariance_matrix)]

## create multivariate normal distribution
sample_distribution <- mvrnorm(n = sample_size,
                               mu = sample_meanvector,
                               Sigma = sample_covariance_matrix)

However, every time I run this mvrnorm function, I got the error:但是,每次运行这个 mvrnorm function 时,我都会收到错误消息:

Error in mvrnorm(n = sample_size, mu = sample_meanvector, Sigma = sample_covariance_matrix): 'Sigma' is not positive definite mvrnorm(n = sample_size,mu = sample_meanvector,Sigma = sample_covariance_matrix)中的错误:“Sigma”不是正定的

I have two questions:我有两个问题:

  1. Why do I have this error?为什么我有这个错误?
  2. How can I edit my codes to simulate the high dimension multivariate normal data following my idea mentioned above?如何按照我上面提到的想法编辑我的代码来模拟高维多元正态数据?

Thanks much!非常感谢!

Here is an approach that can generate a highly correlated matrix:这是一种可以生成高度相关矩阵的方法:

library(MASS)
library(Matrix)

sample_size <- 100      
sample_meanvector <- c(runif(200,0,1), runif(200,6,8))
sample_covariance_matrix <- matrix(NA, nrow = 400, ncol = 400)
diag(sample_covariance_matrix) <- 1
set.seed(666)

mat_Sim <- matrix(data = NA, nrow = 400, ncol = 400)
U <- runif(n = 400) * 0.5

for(i in 1 : 400)
{
  if(i <= 350)
  {
    U_Star <- pmin(U + 0.25 * runif(n = 400), 0.99999)
    
  }else
  {
    U_Star <- pmin(pmax(U + sample(c(-1, 1), size = 400, replace = TRUE) * runif(n = 400), 0.00001), 0.99999)
  }
  
  mat_Sim[, i] <- qnorm(U_Star)  
}

cor_Mat <- cor(mat_Sim)
sample_covariance_matrix <- cor_Mat * 200

## create multivariate normal distribution
sample_distribution <- mvrnorm(n = sample_size,
                               mu = sample_meanvector,
                               Sigma = sample_covariance_matrix)

image(cor_Mat)

with 7 / 8 of the matrix correlated between 0.6 and 0.8 and 1 / 8 of the matrix correlated between 0 and 0.3.矩阵的 7 / 8 在 0.6 和 0.8 之间相关,矩阵的 1 / 8 在 0 和 0.3 之间相关。 相关矩阵的热图

You can consider the following code:您可以考虑以下代码:

library(MASS)
library(Matrix)

sample_size <- 100      
sample_meanvector <- c(runif(200,0,1), runif(200,6,8))
sample_covariance_matrix <- matrix(NA, nrow = 400, ncol = 400)
diag(sample_covariance_matrix) <- 1
set.seed(666)
sample_covariance_matrix[lower.tri(sample_covariance_matrix)] <- runif(79800, 0.00001, 0.2)
sample_covariance_matrix[lower.tri(sample_covariance_matrix)][sample(1:79800, 10000)] <- runif(10000, 0.6, 0.9)
sample_covariance_matrix[upper.tri(sample_covariance_matrix)]<-t(sample_covariance_matrix)[upper.tri(sample_covariance_matrix)]
sample_covariance_matrix_Near_PD <- nearPD(sample_covariance_matrix)$mat


## create multivariate normal distribution
sample_distribution <- mvrnorm(n = sample_size,
                               mu = sample_meanvector,
                               Sigma = sample_covariance_matrix_Near_PD)

In my question, I put high correlations and low correlations in the same covariance matrix.在我的问题中,我将高相关性和低相关性放在同一个协方差矩阵中。 And then the high correlations are destroyed by the "find close positive definite matrix" function. However, when I simulate these two parts separately, the high correlation won't be affected much.然后高相关性被“find close positive definite matrix”function破坏了。但是,当我分别模拟这两个部分时,高相关性不会受到太大影响。 Below are my updated codes:以下是我更新的代码:

# load library MASS
library(MASS)
sample_size <- 100

##### Saperate high and low correlation part ####
### High correlation ###
sample_mean_high <- c(runif(50, -1 , 1))
sample_cov_high <- matrix(0, nrow = 50, ncol = 50)
diag(sample_cov_high) <- 1
sample_cov_high[lower.tri(sample_cov_high)] <- runif(1225, 0.6, 0.9)
sample_cov_high[upper.tri(sample_cov_high)] <- t(sample_cov_high[lower.tri(sample_cov_high)])
sample_cov_high <- sfsmisc::posdefify(sample_cov_high)
# create multivariate normal distribution
sample_dist_high <- mvrnorm(n = sample_size, mu = sample_mean_high, Sigma = sample_cov_high)

### Low correlation ###
sample_mean_low <- c(runif(350, -1 , 1))
sample_cov_low <- matrix(0, nrow = 350, ncol = 350)
diag(sample_cov_low) <- 1
sample_cov_low[lower.tri(sample_cov_low)] <- runif(61075, -0.5, 0.5)
sample_cov_low[upper.tri(sample_cov_low)] <- t(sample_cov_low[lower.tri(sample_cov_low)])
sample_cov_low <- sfsmisc::posdefify(sample_cov_low)
# create multivariate normal distribution
sample_dist_low <- mvrnorm(n = sample_size, mu = sample_mean_low, Sigma = sample_cov_low)

### Check correlation ###
cor_high <- cor(sample_dist_high)
hist(cor_high[lower.tri(cor_high)])
cor_low <- cor(sample_dist_low)
hist(cor_low[lower.tri(cor_low)])
cor_between <- cor(sample_dist_high, sample_dist_low)
hist(cor_between[lower.tri(cor_between)])

Then I get the correlation I expected.然后我得到了我预期的相关性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM