I am having problems running OLS in Python after reading in Stata data. Below are my codes and error message
import pandas as pd # To read data
import numpy as np
import statsmodels.api as sm
gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())
The error message says:
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
So any thoughts how to run this simple OLS?
Your age
variable contains a value "89 or older"
which is causing it to be read as a string, which is not a valid input for statsmodels
. You have to deal with this so it can be read as integer or float, for example like this:
gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
gss = gss[gss.age != '89 or older']
gss['age'] = gss.age.astype(float)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())
PS I'm not saying that dropping observations where age == "89 or older"
is the best way. You'll have to decide how best to deal with this. If you want to have a categorical variable in your model you'll have to create dummies first.
EDIT: If your.dta file contains a numeric value with value labels, the value labels will be used as values by default causing it to be read as string. You can use convert_categoricals=False
with pd.read_stata
to read in the numeric values.
An alternative second line of @Wouter's solution could be:
gss.loc[gss.age=='89 or older','age']='89'
See this discussion of replacing based on a condition for more details.
Of course, whether this replacement is appropriate depends on your use case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.