简体   繁体   中英

Poisson Regression in statsmodels and R

Given the some randomly generated data with

  • 2 columns,
  • 50 rows and
  • integer range between 0-100

With R , the poisson glm and diagnostics plot can be achieved as such:

> col=2
> row=50
> range=0:100
> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))
> model <- glm(X2 ~ X1, data = df, family = poisson)
> glm.diag.plots(model)

In Python , this would give me the line predictor vs residual plot :

import numpy as np
import pandas as pd
import statsmodels.formula.api
from statsmodels.genmod.families import Poisson
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randint(100, size=(50,2)))
df.rename(columns={0:'X1', 1:'X2'}, inplace=True)
glm = statsmodels.formula.api.gee
model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())
results = model.fit()

And to plot the diagnostics in Python:

model_fitted_y = results.fittedvalues  # fitted values (need a constant term for intercept)
model_residuals = results.resid # model residuals
model_abs_resid = np.abs(model_residuals)  # absolute residuals


plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_xlabel('Line Predictor')
plot_lm_1.axes[0].set_ylabel('Residuals')
plt.show()

But when I try to get the cook statistics,

# cook's distance, from statsmodels internals
model_cooks = results.get_influence().cooks_distance[0]

it threw an error saying:

AttributeError                            Traceback (most recent call last)
<ipython-input-66-0f2bedfa1741> in <module>()
      4 model_residuals = results.resid
      5 # normalized residuals
----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal
      7 # absolute squared normalized residuals
      8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)
     33             pass
     34 
---> 35         obj = getattr(results, attr)
     36         data = results.model.data
     37         how = self._wrap_attrs.get(attr)

AttributeError: 'GEEResults' object has no attribute 'get_influence'

Is there a way to plot out all 4 diagnostic plots in Python like in R?

How do I retrieve the cook statistics of the fitted model results in Python using statsmodels ?

The generalized estimating equations API should give you a different result than R's GLM model estimation. To get similar estimates in statsmodels, you need to use something like:

import pandas as pd
import statsmodels.api as sm

# Read data generated in R using pandas or something similar
df = pd.read_csv(...) # file name goes here

# Add a column of ones for the intercept to create input X
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )

# Relabel dependent variable as y (standard notation)
y = df.X2

# Fit GLM in statsmodels using Poisson link function
sm.GLM(y, X, family = Poisson()).fit().summary()

EDIT -- Here is the rest of the answer on how to get Cook's distance in Poisson regression. This is a script I wrote based on some data generated in R. I compared my values against those in R calculated using the cooks.distance function and the values matched.

from __future__ import division, print_function

import numpy as np
import pandas as pd
import statsmodels.api as sm

PATH = '/Users/robertmilletich/test_reg.csv'


def _weight_matrix(fitted_model):
    """Calculates weight matrix in Poisson regression

    Parameters
    ----------
    fitted_model : statsmodel object
        Fitted Poisson model

    Returns
    -------
    W : 2d array-like
        Diagonal weight matrix in Poisson regression
    """
    return np.diag(fitted_model.fittedvalues)


def _hessian(X, W):
    """Hessian matrix calculated as -X'*W*X

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Weight matrix

    Returns
    -------
    hessian : 2d array-like
        Hessian matrix
    """
    return -np.dot(X.T, np.dot(W, X))


def _hat_matrix(X, W):
    """Calculate hat matrix = W^(1/2) * X * (X'*W*X)^(-1) * X'*W^(1/2)

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Diagonal weight matrix

    Returns
    -------
    hat : 2d array-like
        Hat matrix
    """
    # W^(1/2)
    Wsqrt = W**(0.5)

    # (X'*W*X)^(-1)
    XtWX     = -_hessian(X = X, W = W)
    XtWX_inv = np.linalg.inv(XtWX)

    # W^(1/2)*X
    WsqrtX = np.dot(Wsqrt, X)

    # X'*W^(1/2)
    XtWsqrt = np.dot(X.T, Wsqrt)

    return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))


def main():

    # Load data and separate into X and y
    df = pd.read_csv(PATH)
    X  = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )
    y  = df.X2

    # Fit model
    model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    # Weight matrix
    W = _weight_matrix(model)

    # Hat matrix
    H   = _hat_matrix(X, W)
    hii = np.diag(H) # Diagonal values of hat matrix

    # Pearson residuals
    r = model.resid_pearson

    # Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))
    # Note: dispersion is 1 since we aren't modeling overdispersion
    cooks_d = (r/(1 - hii))**2 * hii/(1*2)

if __name__ == "__main__":
    main()

As an update here

statsmodels has now, since version 0.10, get_influence method also for GLMResults.

https://www.statsmodels.org/dev/examples/notebooks/generated/influence_glm_logit.html

for example:

Print influence and outlier measures for 10 observations with largest cook distance:

infl = res.get_influence(observed=False)
summ_df = infl.summary_frame()
summ_df.sort_values("cooks_d", ascending=False)[:10]

There are no combination plots, but influence plot infl.plot_influence() and index plot infl.plot_index(...) for any of the measures are available.

Generic influence measures for maximum likelihood models is or will become available discrete and other models.

MLE influence measures are based on hessian, ie observed information matrix, while for GLM both expected information matrix and hessian versions are available. In GLM, the distinction is only relevant when non-canonical links are used.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM