简体   繁体   中英

Understanding statsmodels linear regression

I am trying to fit a linear regression model implemented in statsmodels library.

I have a doubt regarding the fit() method. Let's say I have data sample of size 15 and I broke it down into 3 parts and fit the model. Does call to each fit() will fit the model properly or will it overwrite previous values.

import numpy as np
import statsmodels.api as sm

# First call
X = [377, 295, 457, 495, 9] # independent variable
y = [23, 79, 16, 41, 40]    # dependent variable
X = sm.add_constant(X)
ols = sm.OLS(y,X).fit()
#print(ols.summary())

# Second call
X = [243, 493, 106, 227, 334]
y = [3, 5, 1, 62, 92]
X = sm.add_constant(X)
ols = sm.OLS(y,X).fit()
#print(ols.summary())

# Third call
X = [412, 332, 429, 96, 336] 
y = [30, 1, 99, 4, 33]
X = sm.add_constant(X)
ols = sm.OLS(y,X).fit()
#print(ols.summary())

scores = [9, 219, 200, 134, 499]
scores = sm.add_constant(scores)
print(ols.predict(scores))

Each call sm.OLS(y,X) creates a new model instance, each call to .fit() creates a new results instance with a reference to the underlying model. Instances are independent of each other, that is they don't share any attributes except for possibly the underlying data.

However in your example you assign the same name ols to each of the regression results, so the name ols only refers to the last instance.

more details:

Creating a model like sm.OLS(y,X) does not copy the data y and X if the copy is not needed. Specifically, if y and X are numpy ndarrays, then no copy is needed. (Technically, conversion and copy behavior depends on np.asarray(y) and np.asarray(X))

Repeated calls to a fit method creates a new results instance each time, but they hold a reference to the same model instance. For example, we can call fit with different cov_type options which will create the covariance of the parameter estimates using different assumptions.

model = sm.OLS(y,X)
ols_nonrobust = model.fit()
ols_hc = model.fit(cov_type="HC3")

In most models all the relevant information from the fit is attached to the results instance. In the above case we can look at both results instances at the same time, eg comparing the parameter standard errors

ols_nonrobust.bse
ols_hc.bse

statsmodels still has a few cases in RLM and some time series models where some fit options might change the underlying model. In that case, only the last results instance created by fit will have the correct model attributes. Those cases are fine if we fit in a loop where we only need the last instances, but might show incorrect results if several results instances are used at the same time and they refer to the same underlying model instance. http://www.statsmodels.org/devel/pitfalls.html#repeated-calls-to-fit-with-different-parameters

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM