Problem Running OLS with Stata Data in Python

Question

I am having problems running OLS in Python after reading in Stata data. Below are my codes and error message

import pandas as pd  # To read data
import numpy as np
import statsmodels.api as sm

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

The error message says:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

So any thoughts how to run this simple OLS?

Answer 1

Your age variable contains a value "89 or older" which is causing it to be read as a string, which is not a valid input for statsmodels . You have to deal with this so it can be read as integer or float, for example like this:

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
gss = gss[gss.age != '89 or older']
gss['age'] = gss.age.astype(float)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

PS I'm not saying that dropping observations where age == "89 or older" is the best way. You'll have to decide how best to deal with this. If you want to have a categorical variable in your model you'll have to create dummies first.

EDIT: If your.dta file contains a numeric value with value labels, the value labels will be used as values by default causing it to be read as string. You can use convert_categoricals=False with pd.read_stata to read in the numeric values.

Answer 2

An alternative second line of @Wouter's solution could be:

gss.loc[gss.age=='89 or older','age']='89'

See this discussion of replacing based on a condition for more details.

Of course, whether this replacement is appropriate depends on your use case.

Problem Running OLS with Stata Data in Python

Question

2 answers

solution1
4 2020-08-31 19:50:34

solution2
0 2020-09-01 01:52:35

Problem Running OLS with Stata Data in Python

Question

2 answers

solution1 4 2020-08-31 19:50:34

solution2 0 2020-09-01 01:52:35

solution1
4 2020-08-31 19:50:34

solution2
0 2020-09-01 01:52:35