简体   繁体   English

statsmodels 和 R 中的泊松回归

[英]Poisson Regression in statsmodels and R

Given the some randomly generated data with鉴于一些随机生成的数据

  • 2 columns, 2列,
  • 50 rows and 50 行和
  • integer range between 0-100 0-100 之间的整数范围

With R , the poisson glm and diagnostics plot can be achieved as such:使用R ,泊松 glm 和诊断图可以这样实现:

> col=2
> row=50
> range=0:100
> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))
> model <- glm(X2 ~ X1, data = df, family = poisson)
> glm.diag.plots(model)

In Python , this would give me the line predictor vs residual plot :Python 中,这会给我线预测器与残差图

import numpy as np
import pandas as pd
import statsmodels.formula.api
from statsmodels.genmod.families import Poisson
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randint(100, size=(50,2)))
df.rename(columns={0:'X1', 1:'X2'}, inplace=True)
glm = statsmodels.formula.api.gee
model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())
results = model.fit()

And to plot the diagnostics in Python:并在 Python 中绘制诊断:

model_fitted_y = results.fittedvalues  # fitted values (need a constant term for intercept)
model_residuals = results.resid # model residuals
model_abs_resid = np.abs(model_residuals)  # absolute residuals


plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_xlabel('Line Predictor')
plot_lm_1.axes[0].set_ylabel('Residuals')
plt.show()

But when I try to get the cook statistics,但是当我尝试获取厨师统计数据时

# cook's distance, from statsmodels internals
model_cooks = results.get_influence().cooks_distance[0]

it threw an error saying:它抛出了一个错误说:

AttributeError                            Traceback (most recent call last)
<ipython-input-66-0f2bedfa1741> in <module>()
      4 model_residuals = results.resid
      5 # normalized residuals
----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal
      7 # absolute squared normalized residuals
      8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)
     33             pass
     34 
---> 35         obj = getattr(results, attr)
     36         data = results.model.data
     37         how = self._wrap_attrs.get(attr)

AttributeError: 'GEEResults' object has no attribute 'get_influence'

Is there a way to plot out all 4 diagnostic plots in Python like in R?有没有办法像在 R 中一样在 Python 中绘制出所有 4 个诊断图?

How do I retrieve the cook statistics of the fitted model results in Python using statsmodels ?如何使用statsmodels在 Python 中检索拟合模型结果的烹饪统计数据?

The generalized estimating equations API should give you a different result than R's GLM model estimation. 广义估计方程API应该给出与R的GLM模型估计不同的结果。 To get similar estimates in statsmodels, you need to use something like: 要在statsmodel中获得类似的估计,您需要使用以下内容:

import pandas as pd
import statsmodels.api as sm

# Read data generated in R using pandas or something similar
df = pd.read_csv(...) # file name goes here

# Add a column of ones for the intercept to create input X
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )

# Relabel dependent variable as y (standard notation)
y = df.X2

# Fit GLM in statsmodels using Poisson link function
sm.GLM(y, X, family = Poisson()).fit().summary()

EDIT -- Here is the rest of the answer on how to get Cook's distance in Poisson regression. 编辑 - 这是关于如何在泊松回归中获得库克距离的其余答案。 This is a script I wrote based on some data generated in R. I compared my values against those in R calculated using the cooks.distance function and the values matched. 这是我根据R中生成的一些数据编写的脚本。我将我的值与使用cooks.distance函数计算的R值和匹配的值进行了比较。

from __future__ import division, print_function

import numpy as np
import pandas as pd
import statsmodels.api as sm

PATH = '/Users/robertmilletich/test_reg.csv'


def _weight_matrix(fitted_model):
    """Calculates weight matrix in Poisson regression

    Parameters
    ----------
    fitted_model : statsmodel object
        Fitted Poisson model

    Returns
    -------
    W : 2d array-like
        Diagonal weight matrix in Poisson regression
    """
    return np.diag(fitted_model.fittedvalues)


def _hessian(X, W):
    """Hessian matrix calculated as -X'*W*X

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Weight matrix

    Returns
    -------
    hessian : 2d array-like
        Hessian matrix
    """
    return -np.dot(X.T, np.dot(W, X))


def _hat_matrix(X, W):
    """Calculate hat matrix = W^(1/2) * X * (X'*W*X)^(-1) * X'*W^(1/2)

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Diagonal weight matrix

    Returns
    -------
    hat : 2d array-like
        Hat matrix
    """
    # W^(1/2)
    Wsqrt = W**(0.5)

    # (X'*W*X)^(-1)
    XtWX     = -_hessian(X = X, W = W)
    XtWX_inv = np.linalg.inv(XtWX)

    # W^(1/2)*X
    WsqrtX = np.dot(Wsqrt, X)

    # X'*W^(1/2)
    XtWsqrt = np.dot(X.T, Wsqrt)

    return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))


def main():

    # Load data and separate into X and y
    df = pd.read_csv(PATH)
    X  = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )
    y  = df.X2

    # Fit model
    model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    # Weight matrix
    W = _weight_matrix(model)

    # Hat matrix
    H   = _hat_matrix(X, W)
    hii = np.diag(H) # Diagonal values of hat matrix

    # Pearson residuals
    r = model.resid_pearson

    # Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))
    # Note: dispersion is 1 since we aren't modeling overdispersion
    cooks_d = (r/(1 - hii))**2 * hii/(1*2)

if __name__ == "__main__":
    main()

As an update here作为这里的更新

statsmodels has now, since version 0.10, get_influence method also for GLMResults. statsmodels 现在从 0.10 版开始,也为 GLMResults 提供了get_influence方法。

https://www.statsmodels.org/dev/examples/notebooks/generated/influence_glm_logit.html https://www.statsmodels.org/dev/examples/notebooks/generated/influence_glm_logit.html

for example:例如:

Print influence and outlier measures for 10 observations with largest cook distance:打印具有最大烹饪距离的 10 个观测值的影响和异常值度量:

infl = res.get_influence(observed=False)
summ_df = infl.summary_frame()
summ_df.sort_values("cooks_d", ascending=False)[:10]

There are no combination plots, but influence plot infl.plot_influence() and index plot infl.plot_index(...) for any of the measures are available.没有组合图,但可以使用任何infl.plot_influence()影响图infl.plot_influence()和索引图infl.plot_index(...)

Generic influence measures for maximum likelihood models is or will become available discrete and other models.最大似然模型的通用影响度量是或将成为可用的离散模型和其他模型。

MLE influence measures are based on hessian, ie observed information matrix, while for GLM both expected information matrix and hessian versions are available. MLE 影响度量基于 hessian,即观察信息矩阵,而对于 GLM,预期信息矩阵和 hessian 版本都可用。 In GLM, the distinction is only relevant when non-canonical links are used.在 GLM 中,区别仅在使用非规范链接时相关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM