简体   繁体   English

在 Python 中使用 Stata 数据运行 OLS 时出现问题

[英]Problem Running OLS with Stata Data in Python

I am having problems running OLS in Python after reading in Stata data.读取 Stata 数据后,我在 Python 中运行 OLS 时遇到问题。 Below are my codes and error message以下是我的代码和错误信息

import pandas as pd  # To read data
import numpy as np
import statsmodels.api as sm

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

The error message says:错误消息说:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

So any thoughts how to run this simple OLS?那么有什么想法可以运行这个简单的 OLS 吗?

Your age variable contains a value "89 or older" which is causing it to be read as a string, which is not a valid input for statsmodels .您的age变量包含一个值"89 or older" ,这导致它被读取为一个字符串,这不是statsmodels的有效输入。 You have to deal with this so it can be read as integer or float, for example like this:你必须处理它,这样它才能被读作 integer 或浮点数,例如:

gss = pd.read_stata("gssSample.dta", preserve_dtypes=False)
gss = gss[gss.age != '89 or older']
gss['age'] = gss.age.astype(float)
X = gss[['age', 'impinc' ]]
y = gss[['educ']]
X = sm.add_constant(X) # adding a constant
model = sm.OLS(y, X).fit()
print(model.summary())

PS I'm not saying that dropping observations where age == "89 or older" is the best way. PS 我并不是说在age == "89 or older"的地方放弃观察是最好的方法。 You'll have to decide how best to deal with this.你必须决定如何最好地处理这个问题。 If you want to have a categorical variable in your model you'll have to create dummies first.如果你想在你的 model 中有一个分类变量,你必须先创建虚拟变量。

EDIT: If your.dta file contains a numeric value with value labels, the value labels will be used as values by default causing it to be read as string.编辑:如果您的.dta 文件包含带有值标签的数值,则默认情况下值标签将用作值,导致它被读取为字符串。 You can use convert_categoricals=False with pd.read_stata to read in the numeric values.您可以将convert_categoricals=Falsepd.read_stata一起使用来读入数值。

An alternative second line of @Wouter's solution could be: @Wouter 解决方案的另一个第二行可能是:

gss.loc[gss.age=='89 or older','age']='89'

See this discussion of replacing based on a condition for more details.有关更多详细信息,请参阅此关于基于条件替换的讨论

Of course, whether this replacement is appropriate depends on your use case.当然,这种替换是否合适取决于您的用例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM