简体   繁体   中英

How to use scale and shape parameters of gamma GLM in statsmodels

The task

I have data that looks like this:

数据

I want to fit a generalized linear model (glm) to this from a gamma family using statsmodels . Using this model, for each of my observations I want to calculate the probability of observing a value that is smaller than (or equal to) that value. In other words I want to calculate:

P(y <= y_i | x_i)

My questions

  • How do I get the shape and scale parameters from the fitted glm in statsmodels ? According to this question the scale parameter in statsmodels is not parameterized in the normal way. Can I use it directly as input to a gamma distribution in scipy ? Or do I need a transformation first?

  • How do I use these parameters (shape and scale) to get the probabilities? Currently I'm using scipy to generate a distribution for each x_i and get the probability from that. See implementation below.

My current implementation

import scipy.stats as stat
import patsy
import statsmodels.api as sm

# Generate data in correct form
y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')

# Fit model with gamma family and log link
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()

# Predict mean
myData['mu'] = mod.predict(exog=X) 

# Predict probabilities (note that for a gamma distribution mean = shape * scale)
probabilities = np.array(
    [stat.gamma(m_i/mod.scale, scale=mod.scale).cdf(y_i) for m_i, y_i in zip(myData['mu'], myData['y'])]
)

However, when I perform this procedure I get the following result:

带颜色的数据

Currently the predicted probabilities all seem really high. The red line in the graph is the predicted mean. But even for points below this line the predicted cumulative probability is around 80%. This makes me wonder whether the scale parameter I used is indeed the correct one.

In R, you can obtained as estimate of the shape using 1/dispersion (check this post ).The naming of the dispersion estimate in statsmodels is a unfortunately scale . So you did to take the reciprocal of this to get the shape estimate. I show it with an example below:

values = gamma.rvs(2,scale=5,size=500)
fit = sm.GLM(values, np.repeat(1,500), family=sm.families.Gamma(sm.families.links.log())).fit()

This is an intercept only model, and we check the intercept and dispersion (named scale):

[fit.params,fit.scale]
[array([2.27875973]), 0.563667465203953]

So the mean is exp(2.2599) = 9.582131 and if we use shape as 1/dispersion , shape = 1/0.563667465203953 = 1.774096 which is what we simulated.

If I use a simulated dataset, it works perfectly fine. This is what it looks like, with a shape of 10:

from scipy.stats import gamma
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import pandas as pd

_shape = 10
myData = pd.DataFrame({'x':np.random.uniform(0,10,size=500)})
myData['y'] = gamma.rvs(_shape,scale=np.exp(-myData['x']/3 + 0.5)/_shape,size=500)

myData.plot("x","y",kind="scatter")

在此处输入图片说明

Then we fit the model like you did:

y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
mu = mod.predict(exog=X) 

shape_from_model = 1/mod.scale

probabilities = [gamma(shape_from_model, scale=m_i/shape_from_model).cdf(y_i) for m_i, y_i in zip(mu,myData['y'])]

And plot:

fig, ax = plt.subplots()
im = ax.scatter(myData["x"],myData["y"],c=probabilities)
im = ax.scatter(myData['x'],mu,c="r",s=1)
fig.colorbar(im, ax=ax)

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM