简体   繁体   中英

multivariate normal pdf with nan in mean

Is there an efficient implementation in Python to evaluate the PDF of a multivariate normal distribution when there are missing values in x ? I guess the idea would just be that you'd effectively reduce the dimensionality to whatever number of available data points you had for a particular vector for which you are trying to evaluate the probability. But I can't figure out if the scipy implementation has a way to ignore masked values.

eg,

from scipy.stats import multivariate_normal as mvnorm
import numpy as np

means = [0.0,0.0,0.0]
cov = np.array([[1.0,0.2,0.2],[0.2,1.0,0.2],[0.2,0.2,1.0]])
d = mvnorm(means,cov)
x = [0.5,-0.2,np.nan]
d.pdf(x)

yields output:

nan

(as expected)

Is there a way to efficiently evaluate the PDF for only values that are present (in this case, making effectively 3D case into a bivariate case?) using this implementation?

This question is a bit of a tricky in terms of math and code. Let me elaborate.

First, the code. scipy.stats does not offer nan-handling as you desire. Speedy code likely requires implementing the multivariate normal distribution PDF by hand and applying it to NumPy arrays directly. Leveraging vectorization is the only way to efficiently offer this functionality for large-scale datasets. On the other hand, the nan-tolerant function nanTol_pdf() below provides the desired functionality while staying true to the multivariate normal distribution as implemented in SciPy. You might find it sufficient for your use case.

def nanTol_pdf(d, x):
    '''
    Function returns function value of multivariate probability density conditioned on 
    non-NAN indices of the input vector x
    '''
    assert isinstance(d, stats._multivariate.multivariate_normal_frozen) and (isinstance(x,list) or isinstance(x,np.ndarray))
    
    # check presence of nan entries
    if any(np.isnan(x)):
        # indices
        subIndex = np.argwhere(~np.isnan(x)).reshape(-1)

        # lower-dimensional multiv. Gaussian distribution
        lowDim_mean = d.mean[subIndex]
        lowDim_cov  = cov[np.ix_(subIndex, subIndex)]
        lowDim_d    = mvnorm(lowDim_mean, lowDim_cov)

        return (lowDim_d.pdf(x[subIndex]))
    else:
        return d.pdf(x)

Regardless, the fact we can do it shouldn't stop us to think if we should.

Second, the math. Mathematically speaking, it is unclear what you attempt to achieve. In your example, SciPy returns nan as you query it with an ill-defined input vector x . Output not-defined, ie returning not a number ( nan ) seems to be the most appropriate answer. Jointly truncating the distribution d and input vector x circumvents numerical problems but opens up statistical questions. In particular, since the probability density function values cannot be understood as (conditional) probabilities. Moreover, the output alone conceals if truncation was applied. Remember that nanTol_pdf() will happily provide a non-negative real number as an output as long as at least one entry in the vector is a real number. Your use case will decide if this is reasonable.

Finally, I would suggest at least considering missing data imputation techniques before moving forward. Let me know if this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM