简体   繁体   English

将泊松分布拟合到statsmodels中的数据

[英]Fitting a Poisson distribution to data in statsmodels

I am trying to fit a Poisson distribution to my data using statsmodels but I am confused by the results that I am getting and how to use the library. 我正在尝试使用statsmodels将泊松分布拟合到我的数据中,但我对我得到的结果以及如何使用该库感到困惑。

My real data will be a series of numbers that I think that I should be able to describe as having a poisson distribution plus some outliers so eventually I would like to do a robust fit to the data. 我的真实数据将是一系列数字,我认为我应该能够将其描述为具有泊松分布和一些异常值,因此最终我希望对数据进行稳健拟合。

However for testing purposes, I just create a dataset using scipy.stats.poisson 但是出于测试目的,我只使用scipy.stats.poisson创建数据集

samp = scipy.stats.poisson.rvs(4,size=200)

So to fit this using statsmodels I think that I just need to have a constant 'endog' 所以为了适应这种使用statsmodels我认为我只需要一个恒定的'endog'

res = sm.Poisson(samp,np.ones_like(samp)).fit()

print res.summary() print res.summary()

                          Poisson Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  200
Model:                        Poisson   Df Residuals:                      199
Method:                           MLE   Df Model:                            0
Date:                Fri, 27 Jun 2014   Pseudo R-squ.:                   0.000
Time:                        14:28:29   Log-Likelihood:                -404.37
converged:                       True   LL-Null:                       -404.37
                                        LLR p-value:                       nan
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.3938      0.035     39.569      0.000         1.325     1.463
==============================================================================

Ok, that doesn't look right, But if I do 好吧,这看起来不对,但如果我这样做

res.predict()

I get an array of 4.03 (which was the mean for this test sample). 我得到一个4.03的数组(这是该测试样本的平均值)。 So basically, firstly I very confused how to interpret this result from statsmodel and secondly I should probably being doing something completely different if I'm interested in robust parameter estimation of a distribution rather than fitting trends but how should I go about doing that? 所以基本上,首先我非常困惑如何从statsmodel解释这个结果,其次我应该做一些完全不同的事情,如果我对分布的稳健参数估计感兴趣而不是拟合趋势但是我应该怎么做呢?

Edit I should really have given more detail in order to answer the second part of my question. 编辑我应该给出更多细节,以回答我问题的第二部分。

I have an event that occurs a random time after a starting time. 我有一个事件发生在一个开始时间后的随机时间。 When I plot a histogram of the delay times for many events, I see that the distribution looks like a scaled Poisson distribution plus several outlier points which are normally caused by issues in my underlying system. 当我绘制许多事件的延迟时间的直方图时,我看到分布看起来像一个缩放的泊松分布加上几个异常点,这些异常点通常是由我的底层系统中的问题引起的。 So I simply wanted to find the expected time delay for the dataset, excluding the outliers. 所以我只是想找到数据集的预期时间延迟,不包括异常值。 If not for the outliers, I could simply find the mean time. 如果不是异常值,我可以简单地找到平均时间。 I suppose that I could exclude them manually but I thought that I could find something more exacting. 我想我可以手动排除它们,但我认为我可以找到更严格的东西。

Edit On further reflection, I will be considering other distributions instead of sticking with a Poissonion and the details of my issue are probably a distraction from the original question but I've left them here anyway. 编辑在进一步的反思中,我将考虑其他发行版,而不是坚持使用Poissonion,我的问题的细节可能会分散原始问题,但无论如何我都把它们留在了这里。

The Poisson model, as most other models in generalized linear model families or for other discrete data, assumes that we have a transformation that bounds the prediction in the appropriate range. 与广义线性模型族或其他离散数据中的大多数其他模型一样,泊松模型假设我们具有将预测限制在适当范围内的变换。

Poisson works for nonnegative numbers and the transformation is exp , so the model that is estimated assumes that the expected value of an observation, conditional on the explanatory variables is 泊松适用于非负数,且变换是exp ,因此估计的模型假定观察的期望值,以解释变量为条件是

 E(y | x) = exp(X dot params)

To get the lambda parameter of the poisson distribution, we need to use exp, ie 要获得泊松分布的lambda参数,我们需要使用exp,即

>>> np.exp(1.3938)
4.0301355071650118

predict does this by default, but you can request just the linear part (X dot params) with a keyword argument. predict默认情况下会这样做,但您可以使用关键字参数请求线性部分(X dot params)

BTW: statsmodels' controversial terminology endog is y exog is x (has x in it) ( http://statsmodels.sourceforge.net/devel/endog_exog.html ) 顺便说一句:statsmodels有争议的术语endog是y exog是x(其中有x)( http://statsmodels.sourceforge.net/devel/endog_exog.html

Outlier Robust Estimation 异常稳健估计

The answer to the last part of the question is that there is currently no outlier robust estimation in Python for Poisson or other count models, as far as I know. 问题的最后一部分的答案是,据我所知,目前在Python中对泊松或其他计数模型没有异常强大的估计。

For overdispersed data, where the variance is larger than the mean, we can use NegativeBinomial Regression. 对于过度分散的数据,方差大于均值,我们可以使用NegativeBinomial回归。 For outliers in Poisson we would have to use R/Rpy or do manual trimming of outliers. 对于Poisson中的异常值,我们必须使用R / Rpy或手动修整异常值。 Outlier identification could be based on one of the standardized residuals. 异常值识别可以基于标准化残差之一。

It will not be available in statsmodels for some time, unless someone is contributing this. 除非有人为此做出贡献,否则它在一段时间内不会在statsmodels中可用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM