简体   繁体   English

使用scipy.stats库或其他方法生成数据遵循特定边界中的分布

[英]Using scipy.stats library or another method to generate data follows a distribution in a specific boundary

I want to sample with scipy.stats library, using an upper and a lower boundary for the sampled data. 我想使用scipy.stats库进行采样,对采样数据使用上下边界。 I am interested to use scipy.stats.lognorm and scipy.stats.expon and set a constrain (low<=x<=up) on the limits of generated data points and also estimate logp with considering these limits. 我有兴趣使用scipy.stats.lognormscipy.stats.expon并在生成的数据点的限制上设置一个约束(low<=x<=up) ,并在考虑这些限制的情况下估计logp For instance, I can not do 例如我做不到

LogNormal=scipy.stats.lognorm(q=[0,5],scale=[0.25],loc=0.0) #q:upper and lower limits, scale=sigma, loc=mean
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/vol/anaconda/lib/python2.7/site-packages/scipy/stats/_distn_infrastructure.py", line 739, in __call__
    return self.freeze(*args, **kwds)
  File "/vol/anaconda/lib/python2.7/site-packages/scipy/stats/_distn_infrastructure.py", line 736, in freeze
    return rv_frozen(self, *args, **kwds)
  File "/vol/anaconda/lib/python2.7/site-packages/scipy/stats/_distn_infrastructure.py", line 434, in __init__
    shapes, _, _ = self.dist._parse_args(*args, **kwds)
TypeError: _parse_args() got an unexpected keyword argument 'q'

The documentation is a bit confusing, which one is sigma and which input parameter is mean ? 文档有点混乱,哪个是sigma ,哪个输入参数是mean Could anybody give an example, how they should be set with boundaries? 谁能举一个例子,如何设置边界?

There are several problems in your implementation 您的实施中存在几个问题

1, your pdf can not be evaluated at x=0 1,无法在x = 0处评估您的pdf

2, -log(1./sqrt(2*pi)/self.sigma*exp(-0.5*((log(value)-self.mu)/self.sigma)**2)) should be: -log(1./sqrt(2*pi)/self.sigma/value*exp(-0.5*((log(value)-self.mu)/self.sigma)**2)) 2, -log(1./sqrt(2*pi)/self.sigma*exp(-0.5*((log(value)-self.mu)/self.sigma)**2))应该是: -log(1./sqrt(2*pi)/self.sigma/value*exp(-0.5*((log(value)-self.mu)/self.sigma)**2))

(And there may be more) (可能还有更多)

Another consideration is that you may want to keep the parameterization the same as scipy to avoid future confusion. 另一个考虑因素是您可能希望将参数化设置与scipy相同,以避免将来造成混淆。

Therefore, a minimal implementation: 因此,一个最小的实现:

In [112]:
import scipy.stats as ss
import scipy.optimize as so
import numpy as np

class bounded_distr(object):
    def __init__(self, parent_dist):
        self.parent = parent_dist
    def bnd_lpdf(self, x, limits=None, *args, **kwargs):
        if limits and np.diff(limits)<=0:
            return -np.inf #nan may be better idea
        else:
            _v = -log(self.parent.pdf(x, *args, **kwargs))
            _v[x<=limits[0]] = -np.inf
            _v[x>=limits[1]] = -np.inf
            return _v
    def bnd_cdf(self, x, limits=None, *args, **kwargs):
        if limits and np.diff(limits)<=0:
            return 0 #nan may be better idea
        elif limits:
            _v1 = self.parent.cdf(x, *args, **kwargs)
            _v2 = self.parent.cdf(limits[0], *args, **kwargs)
            _v3 = self.parent.cdf(limits[1], *args, **kwargs)
            _v4 = (_v1-_v2)/(_v3-_v2)
            _v4[_v4<0] = np.nan
            _v4[_v4>1] = np.nan
            return _v4
        else:
            return self.parent.cdf(x, *args, **kwargs)
    def bnd_rvs(self, size, limits=None, *args, **kwargs):
        if limits and np.diff(limits)<=0:
            return np.repeat(np.nan, size) #nan may be better idea
        elif limits:
            low, high = limits
            rnd_cdf = np.random.uniform(self.parent.cdf(x=low, *args, **kwargs),
                                        self.parent.cdf(x=high, *args, **kwargs),
                                        size=size)
            return self.parent.ppf(q=rnd_cdf, *args, **kwargs)
        else:
            return self.parent.rvs(size=size, *args, **kwargs)
In [113]:

bnd_logn = bounded_distr(ss.lognorm)
In [114]:

bnd_logn.bnd_rvs(10, limits=(0.1, 0.9), s=1, loc=0)
Out[114]:
array([ 0.23167598,  0.43185726,  0.34763109,  0.71020467,  0.5216074 ,
        0.60883528,  0.34353607,  0.84530444,  0.64145739,  0.82082447])
In [115]:

bnd_logn.bnd_lpdf(np.linspace(0,1,10), limits=(0.1, 0.9), s=1, loc=0)
Out[115]:
array([        inf,  1.13561188,  0.54598554,  0.42380072,  0.43681222,
        0.50389845,  0.5956744 ,  0.69920358,  0.80809192,  0.91893853])
In [116]:

bnd_logn.bnd_cdf(np.linspace(0,1,10), limits=(0.1, 0.9), s=1, loc=0)
Out[116]:
array([        nan,  0.00749028,  0.12434152,  0.28010562,  0.44267888,
        0.59832448,  0.74188947,  0.87201574,  0.98899161,         nan])

I could finally write two classes of prior, which can also sample data based on the given distribution in the given limits. 最后,我可以写出两类prior,它们也可以根据给定范围内的给定分布对数据进行采样。 I used the inverse sampling method to sample data. 我使用逆采样方法对数据进行采样。 My classes are given as following: 我的课程如下:

import os, sys
import logging
import scipy.stats
from numpy import exp, sqrt, log, isfinite, inf, pi
import scipy.special
import scipy.optimize
class LogPrior(object):
    def eval(self, value):
        return 0.
    def __call__(self, value):
        return self.eval(value)
    def sample(self, n=None):
        """ Sample from this prior. The returned array axis=0 is the
            sample axis.

            Parameters
            ----------
            n : int (optional)
                Number of samples to draw
        """
        raise ValueError("Cannot sample from a LogPrior object.")
    def __str__(self):
        return "<LogPrior>"
    def __repr__(self):
        return self.__str__()

Update: The class of Lognormal distribution : 更新:对数正态分布的类:

class LognormalPrior(LogPrior):
    """
    Log-normal log-likelihood.

    Distribution of any random variable whose logarithm is normally
    distributed. A variable might be modeled as log-normal if it can
    be thought of as the multiplicative product of many small
    independent factors.

    .. math::
        f(x \mid \mu, \tau) = \sqrt{\frac{\tau}{2\pi}}\frac{
        \exp\left\{ -\frac{\tau}{2} (\ln(x)-\mu)^2 \right\}}{x}

    :Parameters:
      - `x` : x > 0
      - `mu` : Location parameter.
      - `tau` : Scale parameter (tau > 0).

    .. note::

       :math:`E(X)=e^{\mu+\frac{1}{2\tau}}`
       :math:`Var(X)=(e^{1/\tau}-1)e^{2\mu+\frac{1}{\tau}}`

    """
    def __init__(self, mu, tau, *args, **kwargs):
        super(LognormalPrior, self).__init__(*args, **kwargs)
        self.mu = mu
        self.tau = tau
        self.mean = exp(mu + 1./(2*tau))
        self.median = exp(mu)
        self.mode = exp(mu - 1./tau)
        self.variance = (exp(1./tau) - 1) * exp(2*mu + 1./tau)
        self.sigma=1./sqrt(tau)
    def logp(self, value, limits=None):
        if limits:
           lower,upper=limits
           """Log of lognormal prior probability with hard limits."""
           if value >= lower and value <= upper:
              return -log(1./sqrt(2*pi)/value/self.sigma*exp(-0.5*((log(value)-self.mu)/self.sigma)**2))
           else:
              return -inf           
        else:
           """Log of normal prior probability."""
           return -log(1./sqrt(2*pi)/value/self.sigma*exp(-0.5*((log(value)-self.mu)/self.sigma)**2))
    #Cumulative distribution function of lognormal distribution 
    def cdf(self, value):
       if not isinstance(value, float):
          res=np.empty_like(value)
          for i in range(res.shape[0]):
              if value[i]==0.0:
                 res[i]=0.0
              else:
                 res[i]=0.5+0.5*scipy.special.erf((log(value[i])-self.mu)/(sqrt(2)*self.sigma))
          return res
       else:
          if value==0.0:
             return 0.0
          else:
             return 0.5+0.5*scipy.special.erf((log(value)-self.mu)/(sqrt(2)*self.sigma))

    #sampling data with the given distribution    
    def sample(self, n, limits=None):
        res=np.empty(n)
        if limits:
           lower,upper=limits 
           j=0
           while (j<n):
               def f(x):
           return self.cdf(x)-np.random.uniform(low=0,high=1,size=1)
           s=scipy.optimize.brenth(f,0,20)
           if s >= lower and s <= upper:
          res[j]=s
              j+=1
    else:
       r=np.random.uniform(low=0,high=1,size=n)
       for j in range(n):
               def f(x):
           return self.cdf(x)-r[j]
           s=scipy.optimize.brenth(f,0,20)
           res[j]=s
        return res

The class of Exponential distribution 指数分布的类别

class ExponentialPrior(LogPrior):
    """
    Exponential distribution

    Parameters
    ----------
    lam : float
        lam > 0
        rate or inverse scale
    """
    def __init__(self, lam, *args, **kwargs):
        super(ExponentialPrior, self).__init__(*args, **kwargs)
        self.lam = lam
        self.mean = 1. / lam
        self.median = self.mean * log(2)
        self.mode = 0
        self.variance = lam ** -2
    def logp(self, value, limits=None):
        if limits:
           lower,upper=limits
           """Log of lognormal prior probability with hard limits."""
           if value >= lower and value <= upper:
          return -log(self.lam)+self.lam*value
           else:
              return -inf
        else:
              """Log of normal prior probability."""
              return -log(self.lam)+self.lam*value
    def cdf(self, value):
        """Cumulative distribution function lognormal function""" 
        return (1-exp(-self.lam*value))
    #sampling data with the given distribution    
    def sample(self, n, limits=None):
        res=np.empty(n)
        if limits:
           lower,upper=limits 
           j=0
           while (j<n):
               def f(x):
           return self.cdf(x)-np.random.uniform(low=0,high=1,size=1)
           s=scipy.optimize.brenth(f,0,100)
           if s >= lower and s <= upper:
          res[j]=s
              j+=1
    else:
       r=np.random.uniform(low=0,high=1,size=n)
       for j in range(n):
               def f(x):
           return self.cdf(x)-r[j]
           s=scipy.optimize.brenth(f,0,100)
           res[j]=s
        return res
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import lognorm

mean = 4.0 # Geometric mean == median
standard_deviation = 2.0 # Geometric standard deviation
sigma = np.log(standard_deviation) # Standard deviation of log(X)
x = np.linspace(0.1, 25, num=400) # values for x-axis
pdf = lognorm.pdf(x, sigma, loc=0, scale=mean) # probability distribution
plt.plot(x,pdf)
plt.show()

对数正态图

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM