简体   繁体   中英

statsmodels: What are the allowable formats to give to result.predict() for out-of-sample prediction using formula

I am trying to use statsmodels in python to impute some values in a Pandas DataFrame .

The third and fourth attempts below (df2 and df3) give an error *** AttributeError: 'DataFrame' object has no attribute 'design_info' This seems a strange error, since dataframes never have such an attribute.

In any case, I do not understand what I should be passing to predict() in order to get a prediction for the missing value of A in df2. It might also be nice if the df3 case would give me a prediction which included np.nan for the last element.

import pandas as pd
import numpy as np
import statsmodels.formula.api as sm

df0 = pd.DataFrame({"A": [10,20,30,324,2353,],
                    "B": [20, 30, 10, 100, 2332],
                    "C": [0, -30, 120, 11, 2]})

result0 = sm.ols(formula="A ~ B + C ", data=df0).fit()
print result0.summary()
test0 = result0.predict(df0) #works
print test0

df1 = pd.DataFrame({"A": [10,20,30,324,2353,],
                    "B": [20, 30, 10, 100, 2332],
                    "C": [0, -30, 120, 11, 2]})
result1 = sm.ols(formula="A ~ B+ I(C**2) ", data=df1).fit()
print result1.summary()
test1 = result1.predict(df1) #works
print test1


df2 = pd.DataFrame({"A": [10,20,30,324,2353,np.nan],
                    "B": [20, 30, 10, 100, 2332, 2332],
                    "C": [0, -30, 120, 11, 2, 2 ]})
result2 = sm.ols(formula="A ~ B + C", data=df2).fit()
print result2.summary()

test2 = result2.predict(df2)     # Fails
newvals=df2[['B','C']].dropna()
test2 = result2.predict(newvals)    # Fails
test2 = result2.predict(dict([[vv,df2[vv].values] for vv in newvals.columns]))     # Fails




df3 = pd.DataFrame({"A": [10,20,30,324,2353,2353],
                    "B": [20, 30, 10, 100, 2332, np.nan],
                    "C": [0, -30, 120, 11, 2, 2 ]})
result3 = sm.ols(formula="A ~ B + C", data=df3).fit()
print result3.summary()
test3 = result3.predict(df3)     # Fails

Update using pre-release statsmodels

Using the new release candidate for statsmodels 0.8, the df2 example, above, now works. However, the third (df3) example fails on result3.predict(df3) with ValueError: Wrong number of items passed 5, placement implies 6

Dropping the last row, which contains the np.nan, works, ie result3.predict(df3[:-1]) predicts correctly for the rows for which prediction is possible.

It would still be nice for there to be an option to pass the entire df3, but receive np.nan as prediction for the last row.

By way of answering this question, here is my resulting method to fill in some values in a dataframe with an arbitrary (OLS) model. It drops the np.nans as needed before predicting.

#!/usr/bin/python
import statsmodels.formula.api as sm
import pandas as pd
import numpy as np

def df_impute_values_ols(adf,outvar,model,  verbose=True):
    """Specify a Pandas DataFrame with some null (eg. np.nan) values in column <outvar>.
    Specify a string model (in statsmodels format, which is like R) to use to predict them when they are missing. Nonlinear transformations can be specified in this string.

    e.g.: model='  x1 + np.sin(x1) + I((x1-5)**2) '

    At the moment, this uses OLS, so outvar should be continuous. 

    Two dfs are returned: one containing just the updated rows and a
    subset of columns, and version of the incoming DataFrame with some
    null values filled in (those that have the model variables) will
    be returned, using single imputation.

    This is written to work with statsmodels 0.6.1 (see https://github.com/statsmodels/statsmodels/issues/2171 ) ie this is written in order to avoid ANY NaN's in the modeldf. That should be less necessary in future versions.

    To do: 
    - Add plots to  verbose mode 
    - Models other than OLS should be offered

    Issues:
    - the "horrid kluge" line below will give trouble if there are        
      column names that are part of other column names. This kludge should be 
      temporary, anyway, until statsmodels 0.8 is fixed and released. 

    The latest version of this method will be at 
     https://github.com/cpbl/cpblUtilities/ in stats/
    """
    formula=outvar+' ~ '+model
    rhsv=[vv for vv in adf.columns if vv in model] # This is a horrid kluge! Ne
    updateIndex= adf[pd.isnull(adf[outvar]) ] [rhsv].dropna().index
    modeldf=adf[[outvar]+rhsv].dropna()
    results=sm.ols(formula, data=modeldf).fit()
    if verbose:
        print    results.summary()
    newvals=adf[pd.isnull(adf[outvar])][rhsv].dropna()
    newvals[outvar] = results.predict(newvals)
    adf.loc[updateIndex,outvar]=newvals[outvar]
    if verbose:
        print(' %d rows updated for %s'%(len(newvals),outvar))
    return(newvals, adf)


def test_df_impute_values_ols():
    # Find missing values and fill them in:
    df = pd.DataFrame({"A": [10, 20, 30, 324, 2353, np.nan],
                       "B": [20, 30, 10, 100, 2332, 2332],
                       "C": [0, np.nan, 120, 11, 2, 2 ]})
    newv,df2=df_impute_values_ols(df,'A',' B + C ',  verbose=True)
    print df2
    assert df2.iloc[-1]['A']==2357.5427562610648
    assert df2.size==18

    # Can we handle some missing values which also have missing predictors?
    df = pd.DataFrame({"A": [10, 20, 30,     324, 2353, np.nan, np.nan],
                       "B": [20, 30, 10,     100, 2332, 2332,   2332],
                       "C": [0, np.nan, 120, 11,   2,    2,     np.nan ]})
    newv,df2=df_impute_values_ols(df,'A',' B + C + I(C**2) ',  verbose=True)
    print df2

    assert pd.isnull(  df2.iloc[-1]['A'] )
    assert  df2.iloc[-2]['A'] == 2352.999999999995

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM