statsmodels 和 R 中的泊松回歸

Question

鑒於一些隨機生成的數據

2列，
50 行和
0-100 之間的整數范圍

使用R ，泊松 glm 和診斷圖可以這樣實現：

> col=2
> row=50
> range=0:100
> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))
> model <- glm(X2 ~ X1, data = df, family = poisson)
> glm.diag.plots(model)

在Python 中，這會給我線預測器與殘差圖：

import numpy as np
import pandas as pd
import statsmodels.formula.api
from statsmodels.genmod.families import Poisson
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randint(100, size=(50,2)))
df.rename(columns={0:'X1', 1:'X2'}, inplace=True)
glm = statsmodels.formula.api.gee
model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())
results = model.fit()

並在 Python 中繪制診斷：

model_fitted_y = results.fittedvalues  # fitted values (need a constant term for intercept)
model_residuals = results.resid # model residuals
model_abs_resid = np.abs(model_residuals)  # absolute residuals


plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_xlabel('Line Predictor')
plot_lm_1.axes[0].set_ylabel('Residuals')
plt.show()

但是當我嘗試獲取廚師統計數據時

# cook's distance, from statsmodels internals
model_cooks = results.get_influence().cooks_distance[0]

它拋出了一個錯誤說：

AttributeError                            Traceback (most recent call last)
<ipython-input-66-0f2bedfa1741> in <module>()
      4 model_residuals = results.resid
      5 # normalized residuals
----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal
      7 # absolute squared normalized residuals
      8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)
     33             pass
     34 
---> 35         obj = getattr(results, attr)
     36         data = results.model.data
     37         how = self._wrap_attrs.get(attr)

AttributeError: 'GEEResults' object has no attribute 'get_influence'

有沒有辦法像在 R 中一樣在 Python 中繪制出所有 4 個診斷圖？

如何使用statsmodels在 Python 中檢索擬合模型結果的烹飪統計數據？

Answer 1

廣義估計方程API應該給出與R的GLM模型估計不同的結果。 要在statsmodel中獲得類似的估計，您需要使用以下內容：

import pandas as pd
import statsmodels.api as sm

# Read data generated in R using pandas or something similar
df = pd.read_csv(...) # file name goes here

# Add a column of ones for the intercept to create input X
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )

# Relabel dependent variable as y (standard notation)
y = df.X2

# Fit GLM in statsmodels using Poisson link function
sm.GLM(y, X, family = Poisson()).fit().summary()

編輯 - 這是關於如何在泊松回歸中獲得庫克距離的其余答案。 這是我根據R中生成的一些數據編寫的腳本。我將我的值與使用cooks.distance函數計算的R值和匹配的值進行了比較。

from __future__ import division, print_function

import numpy as np
import pandas as pd
import statsmodels.api as sm

PATH = '/Users/robertmilletich/test_reg.csv'


def _weight_matrix(fitted_model):
    """Calculates weight matrix in Poisson regression

    Parameters
    ----------
    fitted_model : statsmodel object
        Fitted Poisson model

    Returns
    -------
    W : 2d array-like
        Diagonal weight matrix in Poisson regression
    """
    return np.diag(fitted_model.fittedvalues)


def _hessian(X, W):
    """Hessian matrix calculated as -X'*W*X

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Weight matrix

    Returns
    -------
    hessian : 2d array-like
        Hessian matrix
    """
    return -np.dot(X.T, np.dot(W, X))


def _hat_matrix(X, W):
    """Calculate hat matrix = W^(1/2) * X * (X'*W*X)^(-1) * X'*W^(1/2)

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Diagonal weight matrix

    Returns
    -------
    hat : 2d array-like
        Hat matrix
    """
    # W^(1/2)
    Wsqrt = W**(0.5)

    # (X'*W*X)^(-1)
    XtWX     = -_hessian(X = X, W = W)
    XtWX_inv = np.linalg.inv(XtWX)

    # W^(1/2)*X
    WsqrtX = np.dot(Wsqrt, X)

    # X'*W^(1/2)
    XtWsqrt = np.dot(X.T, Wsqrt)

    return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))


def main():

    # Load data and separate into X and y
    df = pd.read_csv(PATH)
    X  = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )
    y  = df.X2

    # Fit model
    model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    # Weight matrix
    W = _weight_matrix(model)

    # Hat matrix
    H   = _hat_matrix(X, W)
    hii = np.diag(H) # Diagonal values of hat matrix

    # Pearson residuals
    r = model.resid_pearson

    # Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))
    # Note: dispersion is 1 since we aren't modeling overdispersion
    cooks_d = (r/(1 - hii))**2 * hii/(1*2)

if __name__ == "__main__":
    main()

Answer 2

作為這里的更新

statsmodels 現在從 0.10 版開始，也為 GLMResults 提供了get_influence方法。

https://www.statsmodels.org/dev/examples/notebooks/generated/influence_glm_logit.html

例如：

打印具有最大烹飪距離的 10 個觀測值的影響和異常值度量：

infl = res.get_influence(observed=False)
summ_df = infl.summary_frame()
summ_df.sort_values("cooks_d", ascending=False)[:10]

沒有組合圖，但可以使用任何infl.plot_influence()影響圖infl.plot_influence()和索引圖infl.plot_index(...) 。

最大似然模型的通用影響度量是或將成為可用的離散模型和其他模型。

MLE 影響度量基於 hessian，即觀察信息矩陣，而對於 GLM，預期信息矩陣和 hessian 版本都可用。 在 GLM 中，區別僅在使用非規范鏈接時相關。

statsmodels 和 R 中的泊松回歸

問題描述

2 個解決方案

解決方案1
11 2018-01-02 21:31:00

解決方案2
0 2021-10-18 00:31:15

statsmodels 和 R 中的泊松回歸

問題描述

2 個解決方案

解決方案1 11 2018-01-02 21:31:00

解決方案2 0 2021-10-18 00:31:15

解決方案1
11 2018-01-02 21:31:00

解決方案2
0 2021-10-18 00:31:15