简体   繁体   English

使用 Scipy 拟合 Weibull 分布

[英]Fitting a Weibull distribution using Scipy

I am trying to recreate maximum likelihood distribution fitting, I can already do this in Matlab and R, but now I want to use scipy.我正在尝试重新创建最大似然分布拟合,我已经可以在 Matlab 和 R 中做到这一点,但现在我想使用 scipy。 In particular, I would like to estimate the Weibull distribution parameters for my data set.特别是,我想估计我的数据集的 Weibull 分布参数。

I have tried this:我试过这个:

import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt

def weib(x,n,a):
    return (a / n) * (x / n)**(a - 1) * np.exp(-(x / n)**a)

data = np.loadtxt("stack_data.csv")

(loc, scale) = s.exponweib.fit_loc_scale(data, 1, 1)
print loc, scale

x = np.linspace(data.min(), data.max(), 1000)
plt.plot(x, weib(x, loc, scale))
plt.hist(data, data.max(), density=True)
plt.show()

And get this:得到这个:

(2.5827280639441961, 3.4955032285727947)

And a distribution that looks like this:一个看起来像这样的分布:

使用 Scipy 的 Weibull 分布

I have been using the exponweib after reading this http://www.johndcook.com/distributions_scipy.html .阅读此http://www.johndcook.com/distributions_scipy.html 后,我一直在使用 exponweib exponweib I have also tried the other Weibull functions in scipy (just in case.).我还尝试了 scipy 中的其他 Weibull 函数(以防万一)。

In Matlab (using the Distribution Fitting Tool - see screenshot) and in R (using both the MASS library function fitdistr and the GAMLSS package) I get a (loc) and b (scale) parameters more like 1.58463497 5.93030013. In Matlab (using the Distribution Fitting Tool - see screenshot) and in R (using both the MASS library function fitdistr and the GAMLSS package) I get a (loc) and b (scale) parameters more like 1.58463497 5.93030013. I believe all three methods use the maximum likelihood method for distribution fitting.我相信这三种方法都使用最大似然法进行分布拟合。

使用 Matlab 的 Weibull 分布

I have posted my data here if you would like to have a go.如果您想要 go,我已经在此处发布了我的数据。 And for completeness I am using Python 2.7,5.为了完整起见,我使用的是 Python 2.7,5。 Scipy 0.12,0. Scipy 0.12,0。 R 2.15.2 and Matlab 2012b. R 2.15.2 和 Matlab 2012b。

Why am I getting a different result??为什么我得到不同的结果?

My guess is that you want to estimate the shape parameter and the scale of the Weibull distribution while keeping the location fixed.我的猜测是您想在保持位置固定的同时估计形状参数和威布尔分布的尺度。 Fixing loc assumes that the values of your data and of the distribution are positive with lower bound at zero.修复loc假设您的数据和分布的值是正值,下限为零。

floc=0 keeps the location fixed at zero, f0=1 keeps the first shape parameter of the exponential weibull fixed at one. floc=0保持位置固定为零, f0=1保持指数威布尔的第一个形状参数固定为 1。

>>> stats.exponweib.fit(data, floc=0, f0=1)
[1, 1.8553346917584836, 0, 6.8820748596850905]
>>> stats.weibull_min.fit(data, floc=0)
[1.8553346917584836, 0, 6.8820748596850549]

The fit compared to the histogram looks ok, but not very good.与直方图相比的拟合看起来不错,但不是很好。 The parameter estimates are a bit higher than the ones you mention are from R and matlab.参数估计值比您提到的来自 R 和 matlab 的估计值要高一些。

Update更新

The closest I can get to the plot that is now available is with unrestricted fit, but using starting values.我能得到的最接近现在可用的图是无限制拟合,但使用起始值。 The plot is still less peaked.情节仍然没有达到顶峰。 Note values in fit that don't have an f in front are used as starting values.注意前面没有 f 的适合值用作起始值。

>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> plt.plot(data, stats.exponweib.pdf(data, *stats.exponweib.fit(data, 1, 1, scale=02, loc=0)))
>>> _ = plt.hist(data, bins=np.linspace(0, 16, 33), normed=True, alpha=0.5);
>>> plt.show()

指数拟合

It is easy to verify which result is the true MLE, just need a simple function to calculate log likelihood:很容易验证哪个结果是真正的 MLE,只需要一个简单的函数来计算对数似然:

>>> def wb2LL(p, x): #log-likelihood
    return sum(log(stats.weibull_min.pdf(x, p[1], 0., p[0])))
>>> adata=loadtxt('/home/user/stack_data.csv')
>>> wb2LL(array([6.8820748596850905, 1.8553346917584836]), adata)
-8290.1227946678173
>>> wb2LL(array([5.93030013, 1.57463497]), adata)
-8410.3327470347667

The result from fit method of exponweib and R fitdistr (@Warren) is better and has higher log likelihood. exponweib和 R fitdistr (@Warren) 的fit方法的结果更好,并且具有更高的对数似然。 It is more likely to be the true MLE.它更有可能是真正的 MLE。 It is not surprising that the result from GAMLSS is different. GAMLSS 的结果不同也就不足为奇了。 It is a complete different statistic model: Generalized Additive Model.它是一个完全不同的统计模型:Generalized Additive Model。

Still not convinced?还是不相信? We can draw a 2D confidence limit plot around MLE, see Meeker and Escobar's book for detail).我们可以围绕 MLE 绘制 2D 置信限图,详情请参阅 Meeker 和 Escobar 的书)。多维置信区域

Again this verifies that array([6.8820748596850905, 1.8553346917584836]) is the right answer as loglikelihood is lower that any other point in the parameter space.这再次验证了array([6.8820748596850905, 1.8553346917584836])是正确的答案,因为对数似然比参数空间中的任何其他点都低。 Note:笔记:

>>> log(array([6.8820748596850905, 1.8553346917584836]))
array([ 1.92892018,  0.61806511])

BTW1, MLE fit may not appears to fit the distribution histogram tightly.顺便说一句,MLE 拟合可能似乎与分布直方图不太吻合。 An easy way to think about MLE is that MLE is the parameter estimate most probable given the observed data.考虑 MLE 的一种简单方法是 MLE 是给定观察数据最可能的参数估计。 It doesn't need to visually fit the histogram well, that will be something minimizing mean square error.它不需要在视觉上很好地拟合直方图,这将是最小化均方误差的东西。

BTW2, your data appears to be leptokurtic and left-skewed, which means Weibull distribution may not fit your data well.顺便说一句,您的数据似乎是leptokurtic 和左偏的,这意味着Weibull 分布可能不太适合您的数据。 Try, eg Gompertz-Logistic, which improves log-likelihood by another about 100.试试,例如 Gompertz-Logistic,它将对数似然再提高大约 100。 在此处输入图片说明在此处输入图片说明 Cheers!干杯!

I know it's an old post, but I just faced a similar problem and this thread helped me solve it.我知道这是一个旧帖子,但我刚刚遇到了类似的问题,这个线程帮助我解决了它。 Thought my solution might be helpful for others like me:认为我的解决方案可能对像我这样的其他人有帮助:

# Fit Weibull function, some explanation below
params = stats.exponweib.fit(data, floc=0, f0=1)
shape = params[1]
scale = params[3]
print 'shape:',shape
print 'scale:',scale

#### Plotting
# Histogram first
values,bins,hist = plt.hist(data,bins=51,range=(0,25),normed=True)
center = (bins[:-1] + bins[1:]) / 2.

# Using all params and the stats function
plt.plot(center,stats.exponweib.pdf(center,*params),lw=4,label='scipy')

# Using my own Weibull function as a check
def weibull(u,shape,scale):
    '''Weibull distribution for wind speed u with shape parameter k and scale parameter A'''
    return (shape / scale) * (u / scale)**(shape-1) * np.exp(-(u/scale)**shape)

plt.plot(center,weibull(center,shape,scale),label='Wind analysis',lw=2)
plt.legend()

Some extra info that helped me understand:一些帮助我理解的额外信息:

Scipy Weibull function can take four input parameters: (a,c),loc and scale. Scipy Weibull 函数可以接受四个输入参数:(a,c)、loc 和 scale。 You want to fix the loc and the first shape parameter (a), this is done with floc=0,f0=1.您想修复 loc 和第一个形状参数 (a),这是通过 floc=0,f0=1 完成的。 Fitting will then give you params c and scale, where c corresponds to the shape parameter of the two-parameter Weibull distribution (often used in wind data analysis) and scale corresponds to its scale factor.拟合然后会给你参数 c 和比例,其中 c 对应于双参数威布尔分布的形状参数(通常用于风数据分析),比例对应于其比例因子。

From docs:从文档:

exponweib.pdf(x, a, c) =
    a * c * (1-exp(-x**c))**(a-1) * exp(-x**c)*x**(c-1)

If a is 1, then如果 a 为 1,则

exponweib.pdf(x, a, c) =
    c * (1-exp(-x**c))**(0) * exp(-x**c)*x**(c-1)
  = c * (1) * exp(-x**c)*x**(c-1)
  = c * x **(c-1) * exp(-x**c)

From this, the relation to the 'wind analysis' Weibull function should be more clear由此,与“风分析”威布尔函数的关系应该更清楚

I was curious about your question and, despite this is not an answer, it compares the Matlab result with your result and with the result using leastsq , which showed the best correlation with the given data:我对您的问题很好奇,尽管这不是答案,但它将Matlab结果与您的结果以及使用leastsq的结果进行了leastsq ,这显示了与给定数据的最佳相关性:

在此处输入图片说明

The code is as follows:代码如下:

import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as mtrand
from scipy.integrate import quad
from scipy.optimize import leastsq

## my distribution (Inverse Normal with shape parameter mu=1.0)
def weib(x,n,a):
    return (a / n) * (x / n)**(a-1) * np.exp(-(x/n)**a)

def residuals(p,x,y):
    integral = quad( weib, 0, 16, args=(p[0],p[1]) )[0]
    penalization = abs(1.-integral)*100000
    return y - weib(x, p[0],p[1]) + penalization

#
data = np.loadtxt("stack_data.csv")


x = np.linspace(data.min(), data.max(), 100)
n, bins, patches = plt.hist(data,bins=x, normed=True)
binsm = (bins[1:]+bins[:-1])/2

popt, pcov = leastsq(func=residuals, x0=(1.,1.), args=(binsm,n))

loc, scale = 1.58463497, 5.93030013
plt.plot(binsm,n)
plt.plot(x, weib(x, loc, scale),
         label='weib matlab, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
loc, scale = s.exponweib.fit_loc_scale(data, 1, 1)
plt.plot(x, weib(x, loc, scale),
         label='weib stack, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
plt.plot(x, weib(x,*popt),
         label='weib leastsq, loc=%1.3f, scale=%1.3f' % tuple(popt), lw=4.)

plt.legend(loc='upper right')
plt.show()

I had the same problem, but found that setting loc=0 in exponweib.fit primed the pump for the optimization.我遇到了同样的问题,但发现在exponweib.fit中设置loc=0exponweib.fit泵以进行优化。 That was all that was needed from @user333700's answer .这就是@user333700's answer所需的全部内容。 I couldn't load your data -- your data link points to an image, not data.我无法加载您的数据 - 您的数据链接指向图像,而不是数据。 So I ran a test on my data instead:所以我对我的数据进行了测试:

拟合有问题(双峰?)数据的分布图

import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np

N=30
counts, bins = np.histogram(x, bins=N)
bin_width = bins[1]-bins[0]
total_count = float(sum(counts))

f, ax = plt.subplots(1, 1)
f.suptitle(query_uri)

ax.bar(bins[:-1]+bin_width/2., counts, align='center', width=.85*bin_width)
ax.grid('on')
def fit_pdf(x, name='lognorm', color='r'):
    dist = getattr(ss, name)  # params = shape, loc, scale
    # dist = ss.gamma  # 3 params

    params = dist.fit(x, loc=0)  # 1-day lag minimum for shipping
    y = dist.pdf(bins, *params)*total_count*bin_width
    sqerror_sum = np.log(sum(ci*(yi - ci)**2. for (ci, yi) in zip(counts, y)))
    ax.plot(bins, y, color, lw=3, alpha=0.6, label='%s   err=%3.2f' % (name, sqerror_sum))
    return y

colors = ['r-', 'g-', 'r:', 'g:']

for name, color in zip(['exponweib', 't', 'gamma'], colors): # 'lognorm', 'erlang', 'chi2', 'weibull_min', 
    y = fit_pdf(x, name=name, color=color)

ax.legend(loc='best', frameon=False)
plt.show()

There have been a few answers to this already here and in other places.在这里和其他地方已经有一些答案。 likt in Weibull distribution and the data in the same figure (with numpy and scipy) Weibull 分布中的likt 和同一图中的数据(使用 numpy 和 scipy)

It still took me a while to come up with a clean toy example so I though it would be useful to post.我仍然花了一段时间才想出一个干净的玩具示例,所以我认为发布它会很有用。

from scipy import stats
import matplotlib.pyplot as plt

#input for pseudo data
N = 10000
Kappa_in = 1.8
Lambda_in = 10
a_in = 1
loc_in = 0 

#Generate data from given input
data = stats.exponweib.rvs(a=a_in,c=Kappa_in, loc=loc_in, scale=Lambda_in, size = N)

#The a and loc are fixed in the fit since it is standard to assume they are known
a_out, Kappa_out, loc_out, Lambda_out = stats.exponweib.fit(data, f0=a_in,floc=loc_in)

#Plot
bins = range(51)
fig = plt.figure() 
ax = fig.add_subplot(1, 1, 1)
ax.plot(bins, stats.exponweib.pdf(bins, a=a_out,c=Kappa_out,loc=loc_out,scale = Lambda_out))
ax.hist(data, bins = bins , density=True, alpha=0.5)
ax.annotate("Shape: $k = %.2f$ \n Scale: $\lambda = %.2f$"%(Kappa_out,Lambda_out), xy=(0.7, 0.85), xycoords=ax.transAxes)
plt.show()

the order of loc and scale is messed up in the code: loc 和 scale 的顺序在代码中搞砸了:

plt.plot(x, weib(x, scale, loc))

the scale parameter should come first.比例参数应该放在第一位。

In the meantime, there is really good package out there: reliability.与此同时,有一个非常好的包装:可靠性。 Here is the documentation: reliability @ readthedocs .这是文档:可靠性@ readthedocs

Your code simply becomes:您的代码简单地变为:

from reliability.Fitters import Fit_Weibull_2P
...
wb = Fit_Weibull_2P(failures=data)
plt.show()

Saves a lot of headaches and makes beautiful plots, too.省去了很多麻烦,也可以制作漂亮的情节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM