简体   繁体   English

在 statsmodels OLS 中使用分类变量 class

[英]Using categorical variables in statsmodels OLS class

I want to use statsmodels OLS class to create a multiple regression model. Consider the following dataset:我想使用statsmodels OLS class 创建多元回归 model。考虑以下数据集:

import statsmodels.api as sm
import pandas as pd
import numpy as np

dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
  'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90} 

df = pd.DataFrame.from_dict(dict)

x = data[['debt_ratio', 'industry']]
y = data['cash_flow']

def reg_sm(x, y):
    x = np.array(x).T
    x = sm.add_constant(x)
    results = sm.OLS(endog = y, exog = x).fit()
    return results

When I run the following code:当我运行以下代码时:

reg_sm(x, y)

I get the following error:我收到以下错误:

TypeError: '>=' not supported between instances of 'float' and 'str'

I've tried converting the industry variable to categorical, but I still get an error.我已尝试将industry变量转换为分类变量,但仍然出现错误。 I'm out of options.我别无选择。

You're on the right path with converting to a Categorical dtype.您在转换为 Categorical dtype 时走在正确的道路上。 However, once you convert the DataFrame to a NumPy array, you get an object dtype (NumPy arrays are one uniform type as a whole).但是,一旦将 DataFrame 转换为 NumPy 数组,就会得到一个object dtype(NumPy 数组作为一个整体是一种统一类型)。 This means that the individual values are still underlying str which a regression definitely is not going to like.这意味着单个值仍然是回归绝对不会喜欢的基础str

What you might want to do is to dummify this feature.您可能想要做的是将此功能虚拟化 Instead of factorizing it, which would effectively treat the variable as continuous, you want to maintain some semblance of categorization:而不是分解它,这将有效地将变量视为连续变量,您希望保持一些分类的外观:

>>> import statsmodels.api as sm
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> data = {
...     'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
...    'debt_ratio':np.random.randn(5),
...    'cash_flow':np.random.randn(5) + 90
... }
>>> data = pd.DataFrame.from_dict(data)
>>> data = pd.concat((
...     data,
...     pd.get_dummies(data['industry'], drop_first=True)), axis=1)
>>> # You could also use data.drop('industry', axis=1)
>>> # in the call to pd.concat()
>>> data
         industry  debt_ratio  cash_flow  finance  hospitality  mining  transportation
0          mining    0.357440  88.856850        0            0       1               0
1  transportation    0.377538  89.457560        0            0       0               1
2     hospitality    1.382338  89.451292        0            1       0               0
3         finance    1.175549  90.208520        1            0       0               0
4   entertainment   -0.939276  90.212690        0            0       0               0

Now you have dtypes that statsmodels can better work with.现在您有了 statsmodels 可以更好地使用的 dtype。 The purpose of drop_first is to avoid the dummy trap : drop_first的目的是避免虚拟陷阱

>>> y = data['cash_flow']
>>> x = data.drop(['cash_flow', 'industry'], axis=1)
>>> sm.OLS(y, x).fit()
<statsmodels.regression.linear_model.RegressionResultsWrapper object at 0x115b87cf8>

Lastly, just a small pointer: it helps to try to avoid naming references with names that shadow built-in object types, such as dict .最后,只是一个小指针:它有助于避免使用隐藏内置对象类型的名称命名引用,例如dict

I also had this problem as well and have lots of columns needed to be treated as categorical, and this makes it quite annoying to deal with dummify .我也有这个问题,并且有很多列需​​要被视为分类,这使得处理dummify非常烦人。 And converting to string doesn't work for me.转换为string对我不起作用。

For anyone looking for a solution without onehot-encoding the data, The R interface provides a nice way of doing this:对于那些在没有对数据进行单热编码的情况下寻找解决方案的人,R 接口提供了一种很好的方法来做到这一点:

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

dict = {'industry': ['mining', 'transportation', 'hospitality', 'finance', 'entertainment'],
  'debt_ratio':np.random.randn(5), 'cash_flow':np.random.randn(5) + 90} 

df = pd.DataFrame.from_dict(dict)

x = df[['debt_ratio', 'industry']]
y = df['cash_flow']

# NB. unlike sm.OLS, there is "intercept" term is included here
smf.ols(formula="cash_flow ~ debt_ratio + C(industry)", data=df).fit()

Reference: https://www.statsmodels.org/stable/example_formulas.html#categorical-variables参考: https : //www.statsmodels.org/stable/example_formulas.html#categorical-variables

Just another example from a similar case for categorical variables, which gives correct result compared to a statistics course given in R (Hanken, Finland).只是来自分类变量的类似案例的另一个例子,与 R(芬兰汉肯)中给出的统计课程相比,它给出了正确的结果。

import wooldridge as woo
import statsmodels.formula.api as smf
import numpy as np

df = woo.dataWoo('beauty')
print(df.describe)

df['abvavg'] = (df['looks']>=4).astype(int) # good looking
df['belavg'] = (df['looks']<=2).astype(int) # bad looking

df_female = df[df['female']==1]
df_male = df[df['female']==0]

results_female = smf.ols(formula = 'np.log(wage) ~ belavg + abvavg',data=df_female).fit()
print(f"FEMALE results, summary \n {results_female.summary()}")

results_male = smf.ols(formula = 'np.log(wage) ~ belavg + abvavg',data=df_male).fit()
print(f"MALE results, summary \n {results_male.summary()}")

Terveisin, Markus特维辛,马库斯

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM