使用scipy.stats库或其他方法生成数据遵循特定边界中的分布

Question

I want to sample with scipy.stats library, using an upper and a lower boundary for the sampled data. 我想使用scipy.stats库进行采样，对采样数据使用上下边界。 I am interested to use scipy.stats.lognorm and scipy.stats.expon and set a constrain (low<=x<=up) on the limits of generated data points and also estimate logp with considering these limits. 我有兴趣使用scipy.stats.lognorm和scipy.stats.expon并在生成的数据点的限制上设置一个约束(low<=x<=up) ，并在考虑这些限制的情况下估计logp 。 For instance, I can not do 例如我做不到

LogNormal=scipy.stats.lognorm(q=[0,5],scale=[0.25],loc=0.0) #q:upper and lower limits, scale=sigma, loc=mean
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/vol/anaconda/lib/python2.7/site-packages/scipy/stats/_distn_infrastructure.py", line 739, in __call__
    return self.freeze(*args, **kwds)
  File "/vol/anaconda/lib/python2.7/site-packages/scipy/stats/_distn_infrastructure.py", line 736, in freeze
    return rv_frozen(self, *args, **kwds)
  File "/vol/anaconda/lib/python2.7/site-packages/scipy/stats/_distn_infrastructure.py", line 434, in __init__
    shapes, _, _ = self.dist._parse_args(*args, **kwds)
TypeError: _parse_args() got an unexpected keyword argument 'q'

The documentation is a bit confusing, which one is sigma and which input parameter is mean ? 文档有点混乱，哪个是sigma ，哪个输入参数是mean ？ Could anybody give an example, how they should be set with boundaries? 谁能举一个例子，如何设置边界？

Answer 1

There are several problems in your implementation 您的实施中存在几个问题

1, your pdf can not be evaluated at x=0 1，无法在x = 0处评估您的pdf

2, -log(1./sqrt(2*pi)/self.sigma*exp(-0.5*((log(value)-self.mu)/self.sigma)**2)) should be: -log(1./sqrt(2*pi)/self.sigma/value*exp(-0.5*((log(value)-self.mu)/self.sigma)**2)) 2， -log(1./sqrt(2*pi)/self.sigma*exp(-0.5*((log(value)-self.mu)/self.sigma)**2))应该是： -log(1./sqrt(2*pi)/self.sigma/value*exp(-0.5*((log(value)-self.mu)/self.sigma)**2))

(And there may be more) （可能还有更多）

Another consideration is that you may want to keep the parameterization the same as scipy to avoid future confusion. 另一个考虑因素是您可能希望将参数化设置与scipy相同，以避免将来造成混淆。

Therefore, a minimal implementation: 因此，一个最小的实现：

In [112]:
import scipy.stats as ss
import scipy.optimize as so
import numpy as np

class bounded_distr(object):
    def __init__(self, parent_dist):
        self.parent = parent_dist
    def bnd_lpdf(self, x, limits=None, *args, **kwargs):
        if limits and np.diff(limits)<=0:
            return -np.inf #nan may be better idea
        else:
            _v = -log(self.parent.pdf(x, *args, **kwargs))
            _v[x<=limits[0]] = -np.inf
            _v[x>=limits[1]] = -np.inf
            return _v
    def bnd_cdf(self, x, limits=None, *args, **kwargs):
        if limits and np.diff(limits)<=0:
            return 0 #nan may be better idea
        elif limits:
            _v1 = self.parent.cdf(x, *args, **kwargs)
            _v2 = self.parent.cdf(limits[0], *args, **kwargs)
            _v3 = self.parent.cdf(limits[1], *args, **kwargs)
            _v4 = (_v1-_v2)/(_v3-_v2)
            _v4[_v4<0] = np.nan
            _v4[_v4>1] = np.nan
            return _v4
        else:
            return self.parent.cdf(x, *args, **kwargs)
    def bnd_rvs(self, size, limits=None, *args, **kwargs):
        if limits and np.diff(limits)<=0:
            return np.repeat(np.nan, size) #nan may be better idea
        elif limits:
            low, high = limits
            rnd_cdf = np.random.uniform(self.parent.cdf(x=low, *args, **kwargs),
                                        self.parent.cdf(x=high, *args, **kwargs),
                                        size=size)
            return self.parent.ppf(q=rnd_cdf, *args, **kwargs)
        else:
            return self.parent.rvs(size=size, *args, **kwargs)
In [113]:

bnd_logn = bounded_distr(ss.lognorm)
In [114]:

bnd_logn.bnd_rvs(10, limits=(0.1, 0.9), s=1, loc=0)
Out[114]:
array([ 0.23167598,  0.43185726,  0.34763109,  0.71020467,  0.5216074 ,
        0.60883528,  0.34353607,  0.84530444,  0.64145739,  0.82082447])
In [115]:

bnd_logn.bnd_lpdf(np.linspace(0,1,10), limits=(0.1, 0.9), s=1, loc=0)
Out[115]:
array([        inf,  1.13561188,  0.54598554,  0.42380072,  0.43681222,
        0.50389845,  0.5956744 ,  0.69920358,  0.80809192,  0.91893853])
In [116]:

bnd_logn.bnd_cdf(np.linspace(0,1,10), limits=(0.1, 0.9), s=1, loc=0)
Out[116]:
array([        nan,  0.00749028,  0.12434152,  0.28010562,  0.44267888,
        0.59832448,  0.74188947,  0.87201574,  0.98899161,         nan])

Answer 2

I could finally write two classes of prior, which can also sample data based on the given distribution in the given limits. 最后，我可以写出两类prior，它们也可以根据给定范围内的给定分布对数据进行采样。 I used the inverse sampling method to sample data. 我使用逆采样方法对数据进行采样。 My classes are given as following: 我的课程如下：

import os, sys
import logging
import scipy.stats
from numpy import exp, sqrt, log, isfinite, inf, pi
import scipy.special
import scipy.optimize
class LogPrior(object):
    def eval(self, value):
        return 0.
    def __call__(self, value):
        return self.eval(value)
    def sample(self, n=None):
        """ Sample from this prior. The returned array axis=0 is the
            sample axis.

            Parameters
            ----------
            n : int (optional)
                Number of samples to draw
        """
        raise ValueError("Cannot sample from a LogPrior object.")
    def __str__(self):
        return "<LogPrior>"
    def __repr__(self):
        return self.__str__()

Update: The class of Lognormal distribution : 更新：对数正态分布的类：

class LognormalPrior(LogPrior):
    """
    Log-normal log-likelihood.

    Distribution of any random variable whose logarithm is normally
    distributed. A variable might be modeled as log-normal if it can
    be thought of as the multiplicative product of many small
    independent factors.

    .. math::
        f(x \mid \mu, \tau) = \sqrt{\frac{\tau}{2\pi}}\frac{
        \exp\left\{ -\frac{\tau}{2} (\ln(x)-\mu)^2 \right\}}{x}

    :Parameters:
      - `x` : x > 0
      - `mu` : Location parameter.
      - `tau` : Scale parameter (tau > 0).

    .. note::

       :math:`E(X)=e^{\mu+\frac{1}{2\tau}}`
       :math:`Var(X)=(e^{1/\tau}-1)e^{2\mu+\frac{1}{\tau}}`

    """
    def __init__(self, mu, tau, *args, **kwargs):
        super(LognormalPrior, self).__init__(*args, **kwargs)
        self.mu = mu
        self.tau = tau
        self.mean = exp(mu + 1./(2*tau))
        self.median = exp(mu)
        self.mode = exp(mu - 1./tau)
        self.variance = (exp(1./tau) - 1) * exp(2*mu + 1./tau)
        self.sigma=1./sqrt(tau)
    def logp(self, value, limits=None):
        if limits:
           lower,upper=limits
           """Log of lognormal prior probability with hard limits."""
           if value >= lower and value <= upper:
              return -log(1./sqrt(2*pi)/value/self.sigma*exp(-0.5*((log(value)-self.mu)/self.sigma)**2))
           else:
              return -inf           
        else:
           """Log of normal prior probability."""
           return -log(1./sqrt(2*pi)/value/self.sigma*exp(-0.5*((log(value)-self.mu)/self.sigma)**2))
    #Cumulative distribution function of lognormal distribution 
    def cdf(self, value):
       if not isinstance(value, float):
          res=np.empty_like(value)
          for i in range(res.shape[0]):
              if value[i]==0.0:
                 res[i]=0.0
              else:
                 res[i]=0.5+0.5*scipy.special.erf((log(value[i])-self.mu)/(sqrt(2)*self.sigma))
          return res
       else:
          if value==0.0:
             return 0.0
          else:
             return 0.5+0.5*scipy.special.erf((log(value)-self.mu)/(sqrt(2)*self.sigma))

    #sampling data with the given distribution    
    def sample(self, n, limits=None):
        res=np.empty(n)
        if limits:
           lower,upper=limits 
           j=0
           while (j<n):
               def f(x):
           return self.cdf(x)-np.random.uniform(low=0,high=1,size=1)
           s=scipy.optimize.brenth(f,0,20)
           if s >= lower and s <= upper:
          res[j]=s
              j+=1
    else:
       r=np.random.uniform(low=0,high=1,size=n)
       for j in range(n):
               def f(x):
           return self.cdf(x)-r[j]
           s=scipy.optimize.brenth(f,0,20)
           res[j]=s
        return res

The class of Exponential distribution 指数分布的类别

class ExponentialPrior(LogPrior):
    """
    Exponential distribution

    Parameters
    ----------
    lam : float
        lam > 0
        rate or inverse scale
    """
    def __init__(self, lam, *args, **kwargs):
        super(ExponentialPrior, self).__init__(*args, **kwargs)
        self.lam = lam
        self.mean = 1. / lam
        self.median = self.mean * log(2)
        self.mode = 0
        self.variance = lam ** -2
    def logp(self, value, limits=None):
        if limits:
           lower,upper=limits
           """Log of lognormal prior probability with hard limits."""
           if value >= lower and value <= upper:
          return -log(self.lam)+self.lam*value
           else:
              return -inf
        else:
              """Log of normal prior probability."""
              return -log(self.lam)+self.lam*value
    def cdf(self, value):
        """Cumulative distribution function lognormal function""" 
        return (1-exp(-self.lam*value))
    #sampling data with the given distribution    
    def sample(self, n, limits=None):
        res=np.empty(n)
        if limits:
           lower,upper=limits 
           j=0
           while (j<n):
               def f(x):
           return self.cdf(x)-np.random.uniform(low=0,high=1,size=1)
           s=scipy.optimize.brenth(f,0,100)
           if s >= lower and s <= upper:
          res[j]=s
              j+=1
    else:
       r=np.random.uniform(low=0,high=1,size=n)
       for j in range(n):
               def f(x):
           return self.cdf(x)-r[j]
           s=scipy.optimize.brenth(f,0,100)
           res[j]=s
        return res

Answer 3

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import lognorm

mean = 4.0 # Geometric mean == median
standard_deviation = 2.0 # Geometric standard deviation
sigma = np.log(standard_deviation) # Standard deviation of log(X)
x = np.linspace(0.1, 25, num=400) # values for x-axis
pdf = lognorm.pdf(x, sigma, loc=0, scale=mean) # probability distribution
plt.plot(x,pdf)
plt.show()

对数正态图

使用scipy.stats库或其他方法生成数据遵循特定边界中的分布

问题描述

3 个解决方案

解决方案1
1 已采纳 2014-08-06 15:42:54

解决方案2
0 2014-08-05 00:50:00

解决方案3
-1 2014-08-04 19:17:51

使用scipy.stats库或其他方法生成数据遵循特定边界中的分布

问题描述

3 个解决方案

解决方案1 1 已采纳 2014-08-06 15:42:54

解决方案2 0 2014-08-05 00:50:00

解决方案3 -1 2014-08-04 19:17:51

解决方案1
1 已采纳 2014-08-06 15:42:54

解决方案2
0 2014-08-05 00:50:00

解决方案3
-1 2014-08-04 19:17:51