简体   繁体   中英

How to generate correlated data for arbitrary distributions

The problem:

I want to be able to sample corelated data, defined by a given correlation matrix, from multiple distributions. Regarding the distributions I want to make as little assumptions as possible. It should especially work for non-continuous distributions and continuous distributions alike. (Mean 0 and Variance 1 can of course be assumed) The correlations do not need to be the exact correlations, it is sufficient to have a decent approximation. I do not know the joint distribution, but if one could be created given the single distributions and the correlation matrix, this should solve my problem.

What I tried:

I tried to draw from the distributions first and try to find rearrangements that matched the desired correlations better and better. My starting point was the current best solution and the I shuffled a decreasingly small part of the samples and checked if the result was a better match. If so, this became my new starting point. But the results were rather disappointing. So there are probably better ways to do it.

My code (not providing the desired results):

import numpy as np
import scipy 
import random

#the desired correlation matrix
des_cov1=np.array([[1,0.5,-0.2],[0.5,1,0.2],[-0.2,0.2,1]])

#data drawn from 3 different distributions
data1=np.array([np.random.normal(loc=-3.0, scale=1.0, size=1000), np.random.binomial(1000, 0.2, size=1000),np.random.normal(loc=5.0, scale=7.0, size=1000)])

#A helper function foudn on stackoverflow to shuffle part of the data
#shuffle a given percentage of the data (on average)
def pashuffle(data, perc=10):
    for index, letter in enumerate(data):
        if random.randrange(0, 100) < perc/2:
            new_index = random.randrange(0, len(data))
            data[index], data[new_index] = data[new_index], data[index]
    return data

#Try to correlate the data accoridng to a desired covariation matrix
#shuffle a given percentage of the data (on average)
def pashuffle(data, perc=10):
    for index, letter in enumerate(data):
        if random.randrange(0, 100) < perc/2:
            new_index = random.randrange(0, len(data))
            data[index], data[new_index] = data[new_index], data[index]
    return data

#Try to correlate the data accoridng to a desired covariation matrix
def correlate_data_sets(data, des_cov, nr_trials=100, perc=10, decrease_factor=0.999):
    data=np.array(data, dtype=float)
    #save means and variance to reconsturct the old data
    means= np.mean(data,axis=1)
    standard_deviation= scipy.stats.tstd(data, axis=1)
    for i in range(len(data)):
        bd_mean=means[i]
        bd_std=standard_deviation[i]
        #normalize data
        data[i]=[(j-bd_mean)/bd_std for j in data[i]]
    
    best_data=np.array(data)
    #initialize best deviation with a value that will for sure be beat
    best_deviation=4*len(des_cov)*len(des_cov)
    #try to find a good solution by starting with the current best
    for i in range(nr_trials):
        #shuffle a part of the data, to get a posibly better solution
        for j in range(len(data)-1):
            data[j]=pashuffle(best_data[j],perc)
        #calulate the squared deviation
        deviation_matrix=np.cov(data)-des_cov
        squared_cov_deviation=sum(sum(deviation_matrix*deviation_matrix))
        #replace current data arrangement if new arragnement is better
        if squared_cov_deviation < best_deviation:
            best_data=data.copy()
            best_deviation=squared_cov_deviation
        #decrease the amount of data to shuffle
        perc*=decrease_factor
    
    correlations=np.cov(best_data)
    #transform data back
    for i in range(len(data)):
        bd_mean=means[i]
        bd_std=standard_deviation[i]
        best_data[i]=[(j*bd_std+bd_mean) for j in best_data[i]]
    print(correlations)

    return(best_data)

#Apply the function on the data
correlate_data_sets(data=data1, des_cov=des_cov1)

This is a tricky problem, but you can do it by (1) find the spearman rank correlation you need, (2) generate values from a uniform distribution with this pair-wise correlation, then (3) use the values from this sample as ranks in your arbitrary distributions, to generate values from those distributions. See my paper using this technique at http://ee.hawaii.edu/~mfripp/papers/Fripp_2011_Wind_Reserves.pdf (section 2.2).

If you need more than the right pair-wise rank correlation, you may be able to do it by generating uniformly distributed tuples (one element for each random variable), then using some technique to nudge them into the right correlation structure, then use them as ranks for the arbitrary distributions. That is in the area of copula methods.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM