简体   繁体   中英

How to predict new values using statsmodels.formula.api (python)

I trained the logistic model using the following, from breast cancer data and ONLY using one feature 'mean_area'

from statsmodels.formula.api import logit
logistic_model = logit('target ~ mean_area',breast)
result = logistic_model.fit()

There is a built in predict method in the trained model. However that gives the predicted values of all the training samples. As follows

predictions = result.predict()

Suppose I want the prediction for a new value say 30 How do I used the trained model to out put the value? (rather than reading the coefficients and computing manually)

You can provide new values to the .predict() model as illustrated in output #11 in this notebook from the docs for a single observation. You can provide multiple observations as 2d array , for instance a DataFrame - see docs .

Since you are using the formula API, your input needs to be in the form of a pd.DataFrame so that the column references are available. In your case, you could use something like .predict(pd.DataFrame({'mean_area': [1,2,3]}) .

statsmodels .predict() uses the observations used for fitting only as default when no alternative is provided.

import statsmodels.formula.api as smf


model = smf.ols('y ~ x', data=df).fit()

# Predict for a list of observations, list length can be 1 to many..**
prediction = model.get_prediction(exog=dict(x=[5,10,25])) 
prediction.summary_frame(alpha=0.05)

I had difficulty predicting values using a fresh pandas dataframe. So I added data to be predicted to original dataset post fitting

   y = data['price']
   x1 = data[['size', 'year']]
   data.columns
   #Index(['price', 'size', 'year'], dtype='object')
   x=sm.add_constant(x1)
   results = sm.OLS(y,x).fit()
   results.summary()
   ## predict on unknown data
   data = data.append(pd.DataFrame({'size': [853.0,777], 'year': [2012.0,2013], 'price':[None, None]}))
   data.tail()
   new_x = data.loc[data.price.isnull(), ['size', 'year']]
   results.predict(sm.add_constant(new_x))

This is already answered but I hope this will help.

According to the documentation, the first parameter is "exog".

exog : array_like, optional The values for which you want to predict

Further it says,

"If a formula was used, then exog is processed in the same way as the original data. This transformation needs to have key access to the same variable names, and can be a pandas DataFrame or a dict like object that contains numpy arrays.

If no formula was used, then the provided exog needs to have the same number of columns as the original exog in the model. No transformation of the data is performed except converting it to a numpy array.

Row indices as in pandas data frames are supported, and added to the returned prediction"

from statsmodels.formula.api import logit

logistic_model = logit('target ~ mean_area',breast)
result = logistic_model.fit()

Therefore, you can provide a pandas dataframe (Ex: df) for the exog parameter and the dataframe should contain mean_area as a column. Because 'mean_area' is the predictor or the independent variable.

predictions = logistic_model.predict(exog=df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM